Skip to content

feat(nlp): #805 tokenizer rollout — embedding fix + tokenizers/tokenizer_id for all NLP models#96

Merged
shujaatTracebloc merged 3 commits into
developfrom
fix/simple-text-embedding-frontend
Jun 17, 2026
Merged

feat(nlp): #805 tokenizer rollout — embedding fix + tokenizers/tokenizer_id for all NLP models#96
shujaatTracebloc merged 3 commits into
developfrom
fix/simple-text-embedding-frontend

Conversation

@shujaatTracebloc

@shujaatTracebloc shujaatTracebloc commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Brings every NLP model in the zoo into compliance with the federated tokenizer contract (#805): a run never starts without a tokenizer, so non-HF models must ship a tokenizer.json and HuggingFace models must declare a tokenizer_id.

1. Fix simple_text.py (the original bug)

SimpleTextClassifier had no nn.Embeddingforward() fed Long token IDs straight into nn.Linear, so upload smoke-training failed with mat1 and mat2 must have the same dtype, but got Long and Float. Added a real nn.Embedding(30522, embed_dim) front end + attention-mask-aware mean-pool + the existing head; input_ids.long() handles the text handler's float-cast branch. Matches the sibling simple_token / simple_mlm.

2. HuggingFace models → tokenizer_id (17 files)

Added module-level tokenizer_id (= model_id, the matching tokenizer repo) to every HF model that exposes .config:

  • text_classification (9): bert_base_uncased(+_scratch), distilbert(+_scratch), electra, gemma_2, gte_modernbert, modernbert, phi_3_mini
  • token_classification (7): bert_token, deberta_v3_token, distil_token, electra_token, modernbert_token, roberta_token, xlm_roberta_token
  • masked_language_modeling (1): wide_mini_mlm (subclasses BertForMaskedLM)

3. Custom (non-HF) models → tokenizer.json (real bert-base-uncased)

All non-HF NLP models use vocab_size=30522, so they ship the real bert-base-uncased tokenizer (special tokens [PAD]/[CLS]/[SEP]/[UNK]/[MASK], max id 30521 < embedding table):

  • masked_language_modeling/pytorch/tokenizer.jsonshared, auto-detected. The whole dir is bert-vocab: 11 scratch nn.Modules (incl. netmedgpt_style_warmstart, an HF wrapper that does not expose .config, so the SDK treats it non-HF) plus HF wide_mini (also bert). One sibling is correct for every model here and clobbers nothing.
  • simple_text_tokenizer.json / simple_token_tokenizer.jsondistinct names, not auto-detected. Their dirs also contain HF models with different tokenizers (roberta, xlm-roberta, deberta-v3, modernbert, gemma, phi…), and the SDK auto-attaches any sibling tokenizer.json to every model in the dir — which would silently override those models' tokenizer_id. Distinct naming avoids that; upload explicitly: user.upload_model("simple_text", tokenizer="simple_text_tokenizer.json").

Documents the convention in CLAUDE.md.

Verification (feature/748 SDK branch)

  • All 17 HF files parse a tokenizer_id.
  • All 13 non-HF models are classified non-HF, rejected without a tokenizer, accepted with the shipped one; special-token + embedding-fit checks pass.
  • netmedgpt_style_warmstart wrapper confirmed non-HF (.config absent) and satisfied by the shared tokenizer.json.
  • Full forward → loss → backward → optimizer.step() smoke-training passes for simple_mlm, simple_token, and simple_text through the real upload handlers.

Targets develop per repo convention.

🤖 Generated with Claude Code


Note

Medium Risk
Touches every NLP upload path and tokenizer distribution to clients; misnamed or shared tokenizer.json files could silently override HF tokenizer_id in mixed directories.

Overview
Rolls out issue #805 so federated NLP runs always have a declared tokenizer: HuggingFace zoo models get module-level tokenizer_id, and custom nn.Module models ship a tokenizer.json (with naming rules when they share a directory with HF models).

simple_text is fixed so token IDs go through nn.Embedding, attention-mask-aware mean pooling, and the existing head—resolving upload smoke-training dtype errors when Long IDs hit nn.Linear.

CLAUDE.md documents HF vs custom rules, required special tokens, and how sibling tokenizer.json auto-detection can override tokenizer_id in mixed directories.

Reviewed by Cursor Bugbot for commit 4ff10d9. Bugbot is set up for automated code reviews on this repo. Configure here.

SimpleTextClassifier had no nn.Embedding: forward() flattened the input
and fed it straight into nn.Linear. But NLP inputs are integer token IDs
(Long) from the tokenizer while Linear weights are Float, so upload
smoke-training failed with "mat1 and mat2 must have the same dtype, but
got Long and Float". The "embeddings are already handled by the
tokenizer" comment was a misconception — tokenizers emit IDs, not
embeddings.

Add an nn.Embedding(vocab_size=30522, embed_dim) front end matching the
sibling non-HF NLP models (simple_token / simple_mlm), embed input_ids,
attention-mask-aware mean-pool over the sequence, then the existing
fc1/fc2 head. forward() coerces input_ids to Long so the text-classifier
upload handler's float-dtype cast (loss-less branch) is handled. Output
stays 2D (batch, output_classes).

Ship a WordLevel tokenizer.json with [PAD]/[CLS]/[SEP]/[UNK] alongside
the model — non-HF NLP models must distribute a tokenizer (#805).

Verified end-to-end through TorchTextClassifier.small_training_loop on
the feature/748 SDK branch: structural + #805 tokenizer checks,
output-shape probe, declared-seq-len probe, and forward->loss->backward
->step all pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 20 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

  • .github#57 — fix(fr-gate): pass items at or beyond the required stage · author: @aptracebloc · no reviewer assigned
  • backend#813 — test(#805): contract test for tokenizer fingerprint storage (Task 2) · author: @shujaatTracebloc · reviewer: @aptracebloc
  • backend#815 — chore(deps): bump cryptography from 47.0.0 to 48.0.1 · author: @dependabot · no reviewer assigned
  • cli#78 — fix(dataset rm): delete staging files from a uid-65532 pod, not jobs-manager (#259) · author: @LukasWodka · no reviewer assigned
  • cli#79 — chore(schema): re-sync vendored ingest.v1.json from data-ingestors master · author: @LukasWodka · no reviewer assigned
  • client#261 — feat(installer): fail fast when HOST_DATA_DIR is on a network filesystem · author: @saadqbal · no reviewer assigned
  • client#262 — feat(installer,chart): place datasets on a network mount while MySQL stays local · author: @saadqbal · no reviewer assigned
  • client-runtime#108 — fix(authz): match ingest table prefixes at a segment boundary (close cross-tenant straddle) · author: @LukasWodka · no reviewer assigned
  • client-runtime#114 — fix(jobs): cap training Job backoffLimit to stop crashloops starving the cluster · author: @saadqbal · no reviewer assigned
  • client-runtime#115 — feat(ingestion): run the ingestion pod as the host uid for datasets on NFS · author: @saadqbal · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

A federated NLP run never starts without a tokenizer (#805): non-HF
models must ship a tokenizer.json; HuggingFace models must declare a
tokenizer_id. Apply that across text_classification, token_classification
and masked_language_modeling so every NLP model passes upload validation.

HuggingFace models (expose .config) — add module-level tokenizer_id
(= model_id, the matching tokenizer repo):
  - text_classification: bert_base_uncased(+_scratch), distilbert(+_scratch),
    electra, gemma_2, gte_modernbert, modernbert, phi_3_mini
  - token_classification: bert_token, deberta_v3_token, distil_token,
    electra_token, modernbert_token, roberta_token, xlm_roberta_token
  - masked_language_modeling: wide_mini_mlm (subclasses BertForMaskedLM)

Custom (non-HF) models — ship a real bert-base-uncased tokenizer.json
(vocab 30522, matching their embedding tables; [PAD]/[CLS]/[SEP]/[UNK]/
[MASK]):
  - masked_language_modeling/pytorch/tokenizer.json — shared, auto-detected.
    The whole dir is bert-vocab (11 scratch nn.Modules incl. the
    netmedgpt_warmstart HF-wrapper, which does NOT expose .config so the
    SDK treats it non-HF; plus HF wide_mini, also bert), so one sibling is
    correct for every model and never clobbers a different tokenizer.
  - simple_text_tokenizer.json / simple_token_tokenizer.json — distinct
    names (NOT auto-detected): their dirs also hold HF models with
    different tokenizers, so a bare tokenizer.json sibling would override
    those models' tokenizer_id. Upload explicitly via
    user.upload_model(name, tokenizer="<name>_tokenizer.json").

Replaces the placeholder simple_text tokenizer.json from the previous
commit with the real bert tokenizer under the distinct name. Documents
the tokenizer convention in CLAUDE.md.

Verified on the feature/748 SDK branch: all 17 HF files parse a
tokenizer_id; all 13 non-HF models are classified non-HF, are rejected
without a tokenizer and accepted with the shipped one, and the special
tokens + embedding-fit checks pass. Full forward->loss->backward->step
smoke-training passes for simple_mlm, simple_token and simple_text.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shujaatTracebloc shujaatTracebloc changed the title fix(text): give SimpleTextClassifier a real token-embedding front end feat(nlp): #805 tokenizer rollout — embedding fix + tokenizers/tokenizer_id for all NLP models Jun 17, 2026
The rewritten SimpleTextClassifier uses only nn and F; the bare
`import torch` is no longer referenced and tripped the CI ruff check
(F401). The torch.nn / torch.nn.functional submodule imports still load
torch, so the model loads and trains unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shujaatTracebloc shujaatTracebloc merged commit 65f74d5 into develop Jun 17, 2026
6 checks passed
@shujaatTracebloc shujaatTracebloc deleted the fix/simple-text-embedding-frontend branch June 17, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants