feat(nlp): #805 tokenizer rollout — embedding fix + tokenizers/tokenizer_id for all NLP models by shujaatTracebloc · Pull Request #96 · tracebloc/model-zoo

shujaatTracebloc · 2026-06-17T06:30:00Z

Brings every NLP model in the zoo into compliance with the federated tokenizer contract (#805): a run never starts without a tokenizer, so non-HF models must ship a tokenizer.json and HuggingFace models must declare a tokenizer_id.

1. Fix `simple_text.py` (the original bug)

SimpleTextClassifier had no nn.Embedding — forward() fed Long token IDs straight into nn.Linear, so upload smoke-training failed with mat1 and mat2 must have the same dtype, but got Long and Float. Added a real nn.Embedding(30522, embed_dim) front end + attention-mask-aware mean-pool + the existing head; input_ids.long() handles the text handler's float-cast branch. Matches the sibling simple_token / simple_mlm.

2. HuggingFace models → `tokenizer_id` (17 files)

Added module-level tokenizer_id (= model_id, the matching tokenizer repo) to every HF model that exposes .config:

text_classification (9): bert_base_uncased(+_scratch), distilbert(+_scratch), electra, gemma_2, gte_modernbert, modernbert, phi_3_mini
token_classification (7): bert_token, deberta_v3_token, distil_token, electra_token, modernbert_token, roberta_token, xlm_roberta_token
masked_language_modeling (1): wide_mini_mlm (subclasses BertForMaskedLM)

3. Custom (non-HF) models → `tokenizer.json` (real bert-base-uncased)

All non-HF NLP models use vocab_size=30522, so they ship the real bert-base-uncased tokenizer (special tokens [PAD]/[CLS]/[SEP]/[UNK]/[MASK], max id 30521 < embedding table):

masked_language_modeling/pytorch/tokenizer.json — shared, auto-detected. The whole dir is bert-vocab: 11 scratch nn.Modules (incl. netmedgpt_style_warmstart, an HF wrapper that does not expose .config, so the SDK treats it non-HF) plus HF wide_mini (also bert). One sibling is correct for every model here and clobbers nothing.
simple_text_tokenizer.json / simple_token_tokenizer.json — distinct names, not auto-detected. Their dirs also contain HF models with different tokenizers (roberta, xlm-roberta, deberta-v3, modernbert, gemma, phi…), and the SDK auto-attaches any sibling tokenizer.json to every model in the dir — which would silently override those models' tokenizer_id. Distinct naming avoids that; upload explicitly: user.upload_model("simple_text", tokenizer="simple_text_tokenizer.json").

Documents the convention in CLAUDE.md.

Verification (feature/748 SDK branch)

All 17 HF files parse a tokenizer_id.
All 13 non-HF models are classified non-HF, rejected without a tokenizer, accepted with the shipped one; special-token + embedding-fit checks pass.
netmedgpt_style_warmstart wrapper confirmed non-HF (.config absent) and satisfied by the shared tokenizer.json.
Full forward → loss → backward → optimizer.step() smoke-training passes for simple_mlm, simple_token, and simple_text through the real upload handlers.

Targets develop per repo convention.

🤖 Generated with Claude Code

Note

Medium Risk
Touches every NLP upload path and tokenizer distribution to clients; misnamed or shared tokenizer.json files could silently override HF tokenizer_id in mixed directories.

Overview
Rolls out issue #805 so federated NLP runs always have a declared tokenizer: HuggingFace zoo models get module-level tokenizer_id, and custom nn.Module models ship a tokenizer.json (with naming rules when they share a directory with HF models).

simple_text is fixed so token IDs go through nn.Embedding, attention-mask-aware mean pooling, and the existing head—resolving upload smoke-training dtype errors when Long IDs hit nn.Linear.

CLAUDE.md documents HF vs custom rules, required special tokens, and how sibling tokenizer.json auto-detection can override tokenizer_id in mixed directories.

^{Reviewed by Cursor Bugbot for commit 4ff10d9. Bugbot is set up for automated code reviews on this repo. Configure here.}

SimpleTextClassifier had no nn.Embedding: forward() flattened the input and fed it straight into nn.Linear. But NLP inputs are integer token IDs (Long) from the tokenizer while Linear weights are Float, so upload smoke-training failed with "mat1 and mat2 must have the same dtype, but got Long and Float". The "embeddings are already handled by the tokenizer" comment was a misconception — tokenizers emit IDs, not embeddings. Add an nn.Embedding(vocab_size=30522, embed_dim) front end matching the sibling non-HF NLP models (simple_token / simple_mlm), embed input_ids, attention-mask-aware mean-pool over the sequence, then the existing fc1/fc2 head. forward() coerces input_ids to Long so the text-classifier upload handler's float-dtype cast (loss-less branch) is handled. Output stays 2D (batch, output_classes). Ship a WordLevel tokenizer.json with [PAD]/[CLS]/[SEP]/[UNK] alongside the model — non-HF NLP models must distribute a tokenizer (#805). Verified end-to-end through TorchTextClassifier.small_training_loop on the feature/748 SDK branch: structural + #805 tokenizer checks, output-shape probe, declared-seq-len probe, and forward->loss->backward ->step all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LukasWodka · 2026-06-17T06:30:49Z

👋 Heads-up — Code review queue is at 20 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

.github#57 — fix(fr-gate): pass items at or beyond the required stage · author: @aptracebloc · no reviewer assigned
backend#813 — test(#805): contract test for tokenizer fingerprint storage (Task 2) · author: @shujaatTracebloc · reviewer: @aptracebloc
backend#815 — chore(deps): bump cryptography from 47.0.0 to 48.0.1 · author: @dependabot · no reviewer assigned
cli#78 — fix(dataset rm): delete staging files from a uid-65532 pod, not jobs-manager (#259) · author: @LukasWodka · no reviewer assigned
cli#79 — chore(schema): re-sync vendored ingest.v1.json from data-ingestors master · author: @LukasWodka · no reviewer assigned
client#261 — feat(installer): fail fast when HOST_DATA_DIR is on a network filesystem · author: @saadqbal · no reviewer assigned
client#262 — feat(installer,chart): place datasets on a network mount while MySQL stays local · author: @saadqbal · no reviewer assigned
client-runtime#108 — fix(authz): match ingest table prefixes at a segment boundary (close cross-tenant straddle) · author: @LukasWodka · no reviewer assigned
client-runtime#114 — fix(jobs): cap training Job backoffLimit to stop crashloops starving the cluster · author: @saadqbal · no reviewer assigned
client-runtime#115 — feat(ingestion): run the ingestion pod as the host uid for datasets on NFS · author: @saadqbal · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

A federated NLP run never starts without a tokenizer (#805): non-HF models must ship a tokenizer.json; HuggingFace models must declare a tokenizer_id. Apply that across text_classification, token_classification and masked_language_modeling so every NLP model passes upload validation. HuggingFace models (expose .config) — add module-level tokenizer_id (= model_id, the matching tokenizer repo): - text_classification: bert_base_uncased(+_scratch), distilbert(+_scratch), electra, gemma_2, gte_modernbert, modernbert, phi_3_mini - token_classification: bert_token, deberta_v3_token, distil_token, electra_token, modernbert_token, roberta_token, xlm_roberta_token - masked_language_modeling: wide_mini_mlm (subclasses BertForMaskedLM) Custom (non-HF) models — ship a real bert-base-uncased tokenizer.json (vocab 30522, matching their embedding tables; [PAD]/[CLS]/[SEP]/[UNK]/ [MASK]): - masked_language_modeling/pytorch/tokenizer.json — shared, auto-detected. The whole dir is bert-vocab (11 scratch nn.Modules incl. the netmedgpt_warmstart HF-wrapper, which does NOT expose .config so the SDK treats it non-HF; plus HF wide_mini, also bert), so one sibling is correct for every model and never clobbers a different tokenizer. - simple_text_tokenizer.json / simple_token_tokenizer.json — distinct names (NOT auto-detected): their dirs also hold HF models with different tokenizers, so a bare tokenizer.json sibling would override those models' tokenizer_id. Upload explicitly via user.upload_model(name, tokenizer="<name>_tokenizer.json"). Replaces the placeholder simple_text tokenizer.json from the previous commit with the real bert tokenizer under the distinct name. Documents the tokenizer convention in CLAUDE.md. Verified on the feature/748 SDK branch: all 17 HF files parse a tokenizer_id; all 13 non-HF models are classified non-HF, are rejected without a tokenizer and accepted with the shipped one, and the special tokens + embedding-fit checks pass. Full forward->loss->backward->step smoke-training passes for simple_mlm, simple_token and simple_text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The rewritten SimpleTextClassifier uses only nn and F; the bare `import torch` is no longer referenced and tripped the CI ruff check (F401). The torch.nn / torch.nn.functional submodule imports still load torch, so the model loads and trains unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shujaatTracebloc changed the title ~~fix(text): give SimpleTextClassifier a real token-embedding front end~~ feat(nlp): #805 tokenizer rollout — embedding fix + tokenizers/tokenizer_id for all NLP models Jun 17, 2026

shujaatTracebloc requested review from aptracebloc and divyasinghds June 17, 2026 07:53

divyasinghds approved these changes Jun 17, 2026

View reviewed changes

shujaatTracebloc merged commit 65f74d5 into develop Jun 17, 2026
6 checks passed

shujaatTracebloc deleted the fix/simple-text-embedding-frontend branch June 17, 2026 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(nlp): #805 tokenizer rollout — embedding fix + tokenizers/tokenizer_id for all NLP models#96

feat(nlp): #805 tokenizer rollout — embedding fix + tokenizers/tokenizer_id for all NLP models#96
shujaatTracebloc merged 3 commits into
developfrom
fix/simple-text-embedding-frontend

shujaatTracebloc commented Jun 17, 2026 •

edited by cursor Bot

Loading

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shujaatTracebloc commented Jun 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Fix simple_text.py (the original bug)

2. HuggingFace models → tokenizer_id (17 files)

3. Custom (non-HF) models → tokenizer.json (real bert-base-uncased)

Verification (feature/748 SDK branch)

Uh oh!

LukasWodka commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shujaatTracebloc commented Jun 17, 2026 •

edited by cursor Bot

Loading

1. Fix `simple_text.py` (the original bug)

2. HuggingFace models → `tokenizer_id` (17 files)

3. Custom (non-HF) models → `tokenizer.json` (real bert-base-uncased)