feat(#805): register NLP tokenizer fingerprint at ingest (Task 2) by shujaatTracebloc · Pull Request #281 · tracebloc/data-ingestors

shujaatTracebloc · 2026-06-16T14:47:37Z

Task 2 of tracebloc/backend#805 — Ship tokenizer validation metadata at ingest

Part of the epic federated NLP tokenizer alignment (parent tracebloc/backend#748). This is the data-ingestors half of Task 2; the backend half (a storage contract test) is tracebloc/backend#813.

What & why

At ingest, extract a shipped tokenizer.json's 4 structural integers and ship them on the existing global-metadata channel (send_global_meta_meta) so the backend can later cross-check a contributor tokenizer at dataset linking (Task 4):

field	meaning
`vocab_size`	distinct tokens (`model.vocab` ∪ `added_tokens`) — the bound the model embedding must cover
`mask_token_id`	id of `[MASK]` (`None` for classification)
`pad_token_id`	id of `[PAD]`
`tokenizer_type`	`model.type` (WordLevel / WordPiece / BPE / Unigram)

FL guardrail: only these 4 integers ever cross the cluster boundary — never vocabulary content and never a hash. A custom clinical/knowledge-graph tokenizer's vocabulary is data-derived and must not be centrally fingerprinted.

Changes

modalities/spec.py + registry.py — new is_nlp ModalitySpec flag (mirrors is_self_supervised); NLP_CATEGORIES derived from it (registry single-source, can't drift). Exported from modalities/__init__.py.
validators/tokenizer_validator.py — extract_tokenizer_metadata / load_tokenizer_metadata / _special_token_id. Pure-JSON parse (no tokenizers/transformers dependency added), reusing the existing _extract_vocab so vocab_size matches the SDK's upload-time vocab-fit basis.
file_transfer.py — _copy_tokenizer_if_present(cfg) returns the fingerprint; new get_shipped_tokenizer_metadata(cfg) reads SRC_PATH/tokenizer.json (Config threaded per the P4 refactor).
ingestors/base.py — for NLP datasets, attach file_options["tokenizer"] before send_global_meta_meta, using the run's resolved Config (self.database.config).

Behaviour (non-breaking)

MLM — tokenizer remains mandatory (unchanged).
Text / token classification — tokenizer stays optional: the fingerprint is registered when a tokenizer.json is present; when absent, a warning is logged and the contributor-tokenizer cross-check is simply skipped for that site (the epic's "legacy/skipped" path).
No backend migration — GlobalMetaData.meta_data is a JSONField that stores the key as-is.

Tests

tests/test_tokenizer_fingerprint.py — extraction (BERT/WordPiece, no-mask classification, Unigram), FL guardrail (exactly 4 scalar keys), load_tokenizer_metadata (missing/malformed → None), and the file_transfer helpers.
tests/test_ingestor_base.py — 3 ingest-flow tests (fingerprint registered for each NLP category; absent → warn + still register; non-NLP untouched).
tests/test_modality_registry.py — NLP_CATEGORIES derives from is_nlp and equals the three text categories.

Full suite: 1101 passed, 1 xfailed.

🤖 Generated with Claude Code

Note

Medium Risk
Touches dataset registration metadata and federated privacy boundaries (only scalars should leave the cluster); behavior is additive and optional for classification, but wrong DEST fingerprinting could misalign backend cross-checks at linking.

Overview
Adds #805 Task 2: NLP ingests can ship a 4-field tokenizer fingerprint on the existing global-metadata path (send_global_meta_meta via file_options["tokenizer"]), so the backend can cross-check contributor tokenizers at dataset linking without sending vocab text or hashes.

Modality registry gains is_nlp on the three text categories and a derived NLP_CATEGORIES set (exported like the other flag-derived frozensets).

tokenizer_validator adds extract_tokenizer_metadata, load_tokenizer_metadata, and _special_token_id to pull vocab_size, mask_token_id, pad_token_id, and tokenizer_type from HuggingFace-style tokenizer.json.

file_transfer updates _copy_tokenizer_if_present to return the fingerprint on first copy; get_shipped_tokenizer_metadata reads DEST_PATH/tokenizer.json (staged file used for training), including re-ingest cases where SRC changed but DEST was not recopied.

BaseIngestor for NLP categories attaches the fingerprint before global metadata when present; missing tokenizer logs a warning and registration still completes (optional path for text/token classification).

Tests cover extraction variants, DEST-vs-SRC fingerprinting, ingest wiring for NLP vs non-NLP, and registry NLP_CATEGORIES invariants.

^{Reviewed by Cursor Bugbot for commit 35aab7e. Bugbot is set up for automated code reviews on this repo. Configure here.}

Extract a shipped tokenizer.json's 4 structural integers — vocab_size, mask_token_id, pad_token_id, tokenizer_type — at ingest and ship them on the existing global-metadata channel (send_global_meta_meta), so the backend can cross-check a contributor tokenizer at dataset linking (Task 4). FL guardrail: ONLY these 4 integers cross the cluster boundary — never vocabulary content and never a hash (a custom clinical/KG tokenizer's vocab is data-derived and must not be centrally fingerprinted). - modalities/spec.py + registry.py: new `is_nlp` ModalitySpec flag (mirrors `is_self_supervised`); derive `NLP_CATEGORIES` from it (single source — can't drift). Exported from modalities/__init__.py. - validators/tokenizer_validator.py: `extract_tokenizer_metadata` / `load_tokenizer_metadata` / `_special_token_id` — pure-JSON parse (no tokenizers/transformers dep), reusing the existing `_extract_vocab` so vocab_size matches the SDK's upload-time vocab-fit basis. - file_transfer.py: `_copy_tokenizer_if_present(cfg)` returns the fingerprint; new `get_shipped_tokenizer_metadata(cfg)` reads SRC_PATH/tokenizer.json (Config threaded per the P4 refactor). - ingestors/base.py: for NLP datasets, attach `file_options["tokenizer"]` before send_global_meta_meta, using the run's resolved Config. Non-breaking: text/token classification keep an OPTIONAL tokenizer (warn + skip the cross-check when absent — the epic's legacy/skipped path); MLM keeps its mandatory tokenizer. No backend migration — GlobalMetaData.meta_data is a JSONField. Tests: tests/test_tokenizer_fingerprint.py (extraction + FL guardrail + file_transfer helpers), 3 ingest-flow tests in test_ingestor_base.py, and NLP_CATEGORIES assertions in test_modality_registry.py. Full suite: 1101 pass. Part of tracebloc/backend#805 (Task 2 of 5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LukasWodka · 2026-06-16T14:48:49Z

👋 Heads-up — Code review queue is at 20 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

.github#57 — fix(fr-gate): pass items at or beyond the required stage · author: @aptracebloc · no reviewer assigned
backend#812 — fix(datasets): subset label check for test-dataset compatibility (#811) · author: @saadqbal · no reviewer assigned
backend#813 — test(#805): contract test for tokenizer fingerprint storage (Task 2) · author: @shujaatTracebloc · no reviewer assigned
cli#78 — fix(dataset rm): delete staging files from a uid-65532 pod, not jobs-manager (Release v0.3.10 (ingestion hardening + path-traversal fix + single-label preflight) #259) · author: @LukasWodka · no reviewer assigned
cli#79 — chore(schema): re-sync vendored ingest.v1.json from data-ingestors master · author: @LukasWodka · no reviewer assigned
client#261 — feat(installer): fail fast when HOST_DATA_DIR is on a network filesystem · author: @saadqbal · no reviewer assigned
client#262 — feat(installer,chart): place datasets on a network mount while MySQL stays local · author: @saadqbal · no reviewer assigned
client-runtime#108 — fix(authz): match ingest table prefixes at a segment boundary (close cross-tenant straddle) · author: @LukasWodka · no reviewer assigned
client-runtime#114 — fix(jobs): cap training Job backoffLimit to stop crashloops starving the cluster · author: @saadqbal · no reviewer assigned
client-runtime#115 — feat(ingestion): run the ingestion pod as the host uid for datasets on NFS · author: @saadqbal · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…an id _special_token_id returned entry.get("id") as soon as an added_tokens entry matched the token string — even when that entry carried no "id" — so a malformed entry shadowed the model.vocab mapping that does hold the id, and pad_token_id / mask_token_id could be registered as None. Only return the added_tokens id when present; otherwise fall through to the vocab lookup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f4c4432. Configure here.}

get_shipped_tokenizer_metadata read SRC_PATH/tokenizer.json, but _copy_tokenizer_if_present copies SRC->DEST only once (skips when DEST exists). On a re-ingest that ships a changed source the client keeps training on the existing DEST file, so fingerprinting SRC would register metadata for a tokenizer the site never trained on — breaking the contributor cross-check at linking. Read DEST_PATH/tokenizer.json (the file the client actually loads) so the registered fingerprint always matches what is trained on. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tokenizer-ingest-metadata

shujaatTracebloc self-assigned this Jun 16, 2026

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread tracebloc_ingestor/validators/tokenizer_validator.py Outdated

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread tracebloc_ingestor/file_transfer.py Outdated

divyasinghds approved these changes Jun 17, 2026

View reviewed changes

Merge remote-tracking branch 'origin/develop' into feature/805-task2-…

35aab7e

…tokenizer-ingest-metadata

shujaatTracebloc merged commit 751a98a into develop Jun 17, 2026
4 checks passed

shujaatTracebloc deleted the feature/805-task2-tokenizer-ingest-metadata branch June 17, 2026 08:02

shujaatTracebloc mentioned this pull request Jun 17, 2026

chore(release): v0.3.11 #287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(#805): register NLP tokenizer fingerprint at ingest (Task 2)#281

feat(#805): register NLP tokenizer fingerprint at ingest (Task 2)#281
shujaatTracebloc merged 4 commits into
developfrom
feature/805-task2-tokenizer-ingest-metadata

shujaatTracebloc commented Jun 16, 2026 •

edited by cursor Bot

Loading

Uh oh!

LukasWodka commented Jun 16, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shujaatTracebloc commented Jun 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Task 2 of tracebloc/backend#805 — Ship tokenizer validation metadata at ingest

What & why

Changes

Behaviour (non-breaking)

Tests

Uh oh!

LukasWodka commented Jun 16, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shujaatTracebloc commented Jun 16, 2026 •

edited by cursor Bot

Loading