feat(#805): register NLP tokenizer fingerprint at ingest (Task 2)#281
Merged
shujaatTracebloc merged 4 commits intoJun 17, 2026
Merged
Conversation
Extract a shipped tokenizer.json's 4 structural integers — vocab_size, mask_token_id, pad_token_id, tokenizer_type — at ingest and ship them on the existing global-metadata channel (send_global_meta_meta), so the backend can cross-check a contributor tokenizer at dataset linking (Task 4). FL guardrail: ONLY these 4 integers cross the cluster boundary — never vocabulary content and never a hash (a custom clinical/KG tokenizer's vocab is data-derived and must not be centrally fingerprinted). - modalities/spec.py + registry.py: new `is_nlp` ModalitySpec flag (mirrors `is_self_supervised`); derive `NLP_CATEGORIES` from it (single source — can't drift). Exported from modalities/__init__.py. - validators/tokenizer_validator.py: `extract_tokenizer_metadata` / `load_tokenizer_metadata` / `_special_token_id` — pure-JSON parse (no tokenizers/transformers dep), reusing the existing `_extract_vocab` so vocab_size matches the SDK's upload-time vocab-fit basis. - file_transfer.py: `_copy_tokenizer_if_present(cfg)` returns the fingerprint; new `get_shipped_tokenizer_metadata(cfg)` reads SRC_PATH/tokenizer.json (Config threaded per the P4 refactor). - ingestors/base.py: for NLP datasets, attach `file_options["tokenizer"]` before send_global_meta_meta, using the run's resolved Config. Non-breaking: text/token classification keep an OPTIONAL tokenizer (warn + skip the cross-check when absent — the epic's legacy/skipped path); MLM keeps its mandatory tokenizer. No backend migration — GlobalMetaData.meta_data is a JSONField. Tests: tests/test_tokenizer_fingerprint.py (extraction + FL guardrail + file_transfer helpers), 3 ingest-flow tests in test_ingestor_base.py, and NLP_CATEGORIES assertions in test_modality_registry.py. Full suite: 1101 pass. Part of tracebloc/backend#805 (Task 2 of 5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
|
👋 Heads-up — Code review queue is at 20 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
…an id
_special_token_id returned entry.get("id") as soon as an added_tokens entry
matched the token string — even when that entry carried no "id" — so a
malformed entry shadowed the model.vocab mapping that does hold the id, and
pad_token_id / mask_token_id could be registered as None. Only return the
added_tokens id when present; otherwise fall through to the vocab lookup.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f4c4432. Configure here.
get_shipped_tokenizer_metadata read SRC_PATH/tokenizer.json, but _copy_tokenizer_if_present copies SRC->DEST only once (skips when DEST exists). On a re-ingest that ships a changed source the client keeps training on the existing DEST file, so fingerprinting SRC would register metadata for a tokenizer the site never trained on — breaking the contributor cross-check at linking. Read DEST_PATH/tokenizer.json (the file the client actually loads) so the registered fingerprint always matches what is trained on. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
divyasinghds
approved these changes
Jun 17, 2026
…tokenizer-ingest-metadata
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Task 2 of tracebloc/backend#805 — Ship tokenizer validation metadata at ingest
Part of the epic federated NLP tokenizer alignment (parent tracebloc/backend#748). This is the data-ingestors half of Task 2; the backend half (a storage contract test) is tracebloc/backend#813.
What & why
At ingest, extract a shipped
tokenizer.json's 4 structural integers and ship them on the existing global-metadata channel (send_global_meta_meta) so the backend can later cross-check a contributor tokenizer at dataset linking (Task 4):vocab_sizemodel.vocab∪added_tokens) — the bound the model embedding must covermask_token_id[MASK](Nonefor classification)pad_token_id[PAD]tokenizer_typemodel.type(WordLevel / WordPiece / BPE / Unigram)FL guardrail: only these 4 integers ever cross the cluster boundary — never vocabulary content and never a hash. A custom clinical/knowledge-graph tokenizer's vocabulary is data-derived and must not be centrally fingerprinted.
Changes
modalities/spec.py+registry.py— newis_nlpModalitySpecflag (mirrorsis_self_supervised);NLP_CATEGORIESderived from it (registry single-source, can't drift). Exported frommodalities/__init__.py.validators/tokenizer_validator.py—extract_tokenizer_metadata/load_tokenizer_metadata/_special_token_id. Pure-JSON parse (notokenizers/transformersdependency added), reusing the existing_extract_vocabsovocab_sizematches the SDK's upload-time vocab-fit basis.file_transfer.py—_copy_tokenizer_if_present(cfg)returns the fingerprint; newget_shipped_tokenizer_metadata(cfg)readsSRC_PATH/tokenizer.json(Config threaded per the P4 refactor).ingestors/base.py— for NLP datasets, attachfile_options["tokenizer"]beforesend_global_meta_meta, using the run's resolved Config (self.database.config).Behaviour (non-breaking)
tokenizer.jsonis present; when absent, a warning is logged and the contributor-tokenizer cross-check is simply skipped for that site (the epic's "legacy/skipped" path).GlobalMetaData.meta_datais a JSONField that stores the key as-is.Tests
tests/test_tokenizer_fingerprint.py— extraction (BERT/WordPiece, no-mask classification, Unigram), FL guardrail (exactly 4 scalar keys),load_tokenizer_metadata(missing/malformed → None), and thefile_transferhelpers.tests/test_ingestor_base.py— 3 ingest-flow tests (fingerprint registered for each NLP category; absent → warn + still register; non-NLP untouched).tests/test_modality_registry.py—NLP_CATEGORIESderives fromis_nlpand equals the three text categories.Full suite: 1101 passed, 1 xfailed.
🤖 Generated with Claude Code
Note
Medium Risk
Touches dataset registration metadata and federated privacy boundaries (only scalars should leave the cluster); behavior is additive and optional for classification, but wrong DEST fingerprinting could misalign backend cross-checks at linking.
Overview
Adds #805 Task 2: NLP ingests can ship a 4-field tokenizer fingerprint on the existing global-metadata path (
send_global_meta_metaviafile_options["tokenizer"]), so the backend can cross-check contributor tokenizers at dataset linking without sending vocab text or hashes.Modality registry gains
is_nlpon the three text categories and a derivedNLP_CATEGORIESset (exported like the other flag-derived frozensets).tokenizer_validatoraddsextract_tokenizer_metadata,load_tokenizer_metadata, and_special_token_idto pullvocab_size,mask_token_id,pad_token_id, andtokenizer_typefrom HuggingFace-styletokenizer.json.file_transferupdates_copy_tokenizer_if_presentto return the fingerprint on first copy;get_shipped_tokenizer_metadatareadsDEST_PATH/tokenizer.json(staged file used for training), including re-ingest cases where SRC changed but DEST was not recopied.BaseIngestorfor NLP categories attaches the fingerprint before global metadata when present; missing tokenizer logs a warning and registration still completes (optional path for text/token classification).Tests cover extraction variants, DEST-vs-SRC fingerprinting, ingest wiring for NLP vs non-NLP, and registry
NLP_CATEGORIESinvariants.Reviewed by Cursor Bugbot for commit 35aab7e. Bugbot is set up for automated code reviews on this repo. Configure here.