feat(nlp): #805 tokenizer rollout — embedding fix + tokenizers/tokenizer_id for all NLP models#96
Merged
Conversation
SimpleTextClassifier had no nn.Embedding: forward() flattened the input and fed it straight into nn.Linear. But NLP inputs are integer token IDs (Long) from the tokenizer while Linear weights are Float, so upload smoke-training failed with "mat1 and mat2 must have the same dtype, but got Long and Float". The "embeddings are already handled by the tokenizer" comment was a misconception — tokenizers emit IDs, not embeddings. Add an nn.Embedding(vocab_size=30522, embed_dim) front end matching the sibling non-HF NLP models (simple_token / simple_mlm), embed input_ids, attention-mask-aware mean-pool over the sequence, then the existing fc1/fc2 head. forward() coerces input_ids to Long so the text-classifier upload handler's float-dtype cast (loss-less branch) is handled. Output stays 2D (batch, output_classes). Ship a WordLevel tokenizer.json with [PAD]/[CLS]/[SEP]/[UNK] alongside the model — non-HF NLP models must distribute a tokenizer (#805). Verified end-to-end through TorchTextClassifier.small_training_loop on the feature/748 SDK branch: structural + #805 tokenizer checks, output-shape probe, declared-seq-len probe, and forward->loss->backward ->step all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
👋 Heads-up — Code review queue is at 20 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
A federated NLP run never starts without a tokenizer (#805): non-HF
models must ship a tokenizer.json; HuggingFace models must declare a
tokenizer_id. Apply that across text_classification, token_classification
and masked_language_modeling so every NLP model passes upload validation.
HuggingFace models (expose .config) — add module-level tokenizer_id
(= model_id, the matching tokenizer repo):
- text_classification: bert_base_uncased(+_scratch), distilbert(+_scratch),
electra, gemma_2, gte_modernbert, modernbert, phi_3_mini
- token_classification: bert_token, deberta_v3_token, distil_token,
electra_token, modernbert_token, roberta_token, xlm_roberta_token
- masked_language_modeling: wide_mini_mlm (subclasses BertForMaskedLM)
Custom (non-HF) models — ship a real bert-base-uncased tokenizer.json
(vocab 30522, matching their embedding tables; [PAD]/[CLS]/[SEP]/[UNK]/
[MASK]):
- masked_language_modeling/pytorch/tokenizer.json — shared, auto-detected.
The whole dir is bert-vocab (11 scratch nn.Modules incl. the
netmedgpt_warmstart HF-wrapper, which does NOT expose .config so the
SDK treats it non-HF; plus HF wide_mini, also bert), so one sibling is
correct for every model and never clobbers a different tokenizer.
- simple_text_tokenizer.json / simple_token_tokenizer.json — distinct
names (NOT auto-detected): their dirs also hold HF models with
different tokenizers, so a bare tokenizer.json sibling would override
those models' tokenizer_id. Upload explicitly via
user.upload_model(name, tokenizer="<name>_tokenizer.json").
Replaces the placeholder simple_text tokenizer.json from the previous
commit with the real bert tokenizer under the distinct name. Documents
the tokenizer convention in CLAUDE.md.
Verified on the feature/748 SDK branch: all 17 HF files parse a
tokenizer_id; all 13 non-HF models are classified non-HF, are rejected
without a tokenizer and accepted with the shipped one, and the special
tokens + embedding-fit checks pass. Full forward->loss->backward->step
smoke-training passes for simple_mlm, simple_token and simple_text.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The rewritten SimpleTextClassifier uses only nn and F; the bare `import torch` is no longer referenced and tripped the CI ruff check (F401). The torch.nn / torch.nn.functional submodule imports still load torch, so the model loads and trains unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
divyasinghds
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Brings every NLP model in the zoo into compliance with the federated tokenizer contract (#805): a run never starts without a tokenizer, so non-HF models must ship a
tokenizer.jsonand HuggingFace models must declare atokenizer_id.1. Fix
simple_text.py(the original bug)SimpleTextClassifierhad nonn.Embedding—forward()fed Long token IDs straight intonn.Linear, so upload smoke-training failed withmat1 and mat2 must have the same dtype, but got Long and Float. Added a realnn.Embedding(30522, embed_dim)front end + attention-mask-aware mean-pool + the existing head;input_ids.long()handles the text handler's float-cast branch. Matches the siblingsimple_token/simple_mlm.2. HuggingFace models →
tokenizer_id(17 files)Added module-level
tokenizer_id(=model_id, the matching tokenizer repo) to every HF model that exposes.config:BertForMaskedLM)3. Custom (non-HF) models →
tokenizer.json(real bert-base-uncased)All non-HF NLP models use
vocab_size=30522, so they ship the real bert-base-uncased tokenizer (special tokens[PAD]/[CLS]/[SEP]/[UNK]/[MASK], max id 30521 < embedding table):masked_language_modeling/pytorch/tokenizer.json— shared, auto-detected. The whole dir is bert-vocab: 11 scratchnn.Modules (incl.netmedgpt_style_warmstart, an HF wrapper that does not expose.config, so the SDK treats it non-HF) plus HFwide_mini(also bert). One sibling is correct for every model here and clobbers nothing.simple_text_tokenizer.json/simple_token_tokenizer.json— distinct names, not auto-detected. Their dirs also contain HF models with different tokenizers (roberta, xlm-roberta, deberta-v3, modernbert, gemma, phi…), and the SDK auto-attaches any siblingtokenizer.jsonto every model in the dir — which would silently override those models'tokenizer_id. Distinct naming avoids that; upload explicitly:user.upload_model("simple_text", tokenizer="simple_text_tokenizer.json").Documents the convention in CLAUDE.md.
Verification (feature/748 SDK branch)
tokenizer_id.netmedgpt_style_warmstartwrapper confirmed non-HF (.configabsent) and satisfied by the sharedtokenizer.json.forward → loss → backward → optimizer.step()smoke-training passes forsimple_mlm,simple_token, andsimple_textthrough the real upload handlers.Targets
developper repo convention.🤖 Generated with Claude Code
Note
Medium Risk
Touches every NLP upload path and tokenizer distribution to clients; misnamed or shared
tokenizer.jsonfiles could silently override HFtokenizer_idin mixed directories.Overview
Rolls out issue #805 so federated NLP runs always have a declared tokenizer: HuggingFace zoo models get module-level
tokenizer_id, and customnn.Modulemodels ship atokenizer.json(with naming rules when they share a directory with HF models).simple_textis fixed so token IDs go throughnn.Embedding, attention-mask-aware mean pooling, and the existing head—resolving upload smoke-training dtype errors when Long IDs hitnn.Linear.CLAUDE.mddocuments HF vs custom rules, required special tokens, and how siblingtokenizer.jsonauto-detection can overridetokenizer_idin mixed directories.Reviewed by Cursor Bugbot for commit 4ff10d9. Bugbot is set up for automated code reviews on this repo. Configure here.