Skip to content

feat(#805): register NLP tokenizer fingerprint at ingest (Task 2)#281

Merged
shujaatTracebloc merged 4 commits into
developfrom
feature/805-task2-tokenizer-ingest-metadata
Jun 17, 2026
Merged

feat(#805): register NLP tokenizer fingerprint at ingest (Task 2)#281
shujaatTracebloc merged 4 commits into
developfrom
feature/805-task2-tokenizer-ingest-metadata

Conversation

@shujaatTracebloc

@shujaatTracebloc shujaatTracebloc commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Task 2 of tracebloc/backend#805 — Ship tokenizer validation metadata at ingest

Part of the epic federated NLP tokenizer alignment (parent tracebloc/backend#748). This is the data-ingestors half of Task 2; the backend half (a storage contract test) is tracebloc/backend#813.

What & why

At ingest, extract a shipped tokenizer.json's 4 structural integers and ship them on the existing global-metadata channel (send_global_meta_meta) so the backend can later cross-check a contributor tokenizer at dataset linking (Task 4):

field meaning
vocab_size distinct tokens (model.vocabadded_tokens) — the bound the model embedding must cover
mask_token_id id of [MASK] (None for classification)
pad_token_id id of [PAD]
tokenizer_type model.type (WordLevel / WordPiece / BPE / Unigram)

FL guardrail: only these 4 integers ever cross the cluster boundary — never vocabulary content and never a hash. A custom clinical/knowledge-graph tokenizer's vocabulary is data-derived and must not be centrally fingerprinted.

Changes

  • modalities/spec.py + registry.py — new is_nlp ModalitySpec flag (mirrors is_self_supervised); NLP_CATEGORIES derived from it (registry single-source, can't drift). Exported from modalities/__init__.py.
  • validators/tokenizer_validator.pyextract_tokenizer_metadata / load_tokenizer_metadata / _special_token_id. Pure-JSON parse (no tokenizers/transformers dependency added), reusing the existing _extract_vocab so vocab_size matches the SDK's upload-time vocab-fit basis.
  • file_transfer.py_copy_tokenizer_if_present(cfg) returns the fingerprint; new get_shipped_tokenizer_metadata(cfg) reads SRC_PATH/tokenizer.json (Config threaded per the P4 refactor).
  • ingestors/base.py — for NLP datasets, attach file_options["tokenizer"] before send_global_meta_meta, using the run's resolved Config (self.database.config).

Behaviour (non-breaking)

  • MLM — tokenizer remains mandatory (unchanged).
  • Text / token classification — tokenizer stays optional: the fingerprint is registered when a tokenizer.json is present; when absent, a warning is logged and the contributor-tokenizer cross-check is simply skipped for that site (the epic's "legacy/skipped" path).
  • No backend migration — GlobalMetaData.meta_data is a JSONField that stores the key as-is.

Tests

  • tests/test_tokenizer_fingerprint.py — extraction (BERT/WordPiece, no-mask classification, Unigram), FL guardrail (exactly 4 scalar keys), load_tokenizer_metadata (missing/malformed → None), and the file_transfer helpers.
  • tests/test_ingestor_base.py — 3 ingest-flow tests (fingerprint registered for each NLP category; absent → warn + still register; non-NLP untouched).
  • tests/test_modality_registry.pyNLP_CATEGORIES derives from is_nlp and equals the three text categories.

Full suite: 1101 passed, 1 xfailed.

🤖 Generated with Claude Code


Note

Medium Risk
Touches dataset registration metadata and federated privacy boundaries (only scalars should leave the cluster); behavior is additive and optional for classification, but wrong DEST fingerprinting could misalign backend cross-checks at linking.

Overview
Adds #805 Task 2: NLP ingests can ship a 4-field tokenizer fingerprint on the existing global-metadata path (send_global_meta_meta via file_options["tokenizer"]), so the backend can cross-check contributor tokenizers at dataset linking without sending vocab text or hashes.

Modality registry gains is_nlp on the three text categories and a derived NLP_CATEGORIES set (exported like the other flag-derived frozensets).

tokenizer_validator adds extract_tokenizer_metadata, load_tokenizer_metadata, and _special_token_id to pull vocab_size, mask_token_id, pad_token_id, and tokenizer_type from HuggingFace-style tokenizer.json.

file_transfer updates _copy_tokenizer_if_present to return the fingerprint on first copy; get_shipped_tokenizer_metadata reads DEST_PATH/tokenizer.json (staged file used for training), including re-ingest cases where SRC changed but DEST was not recopied.

BaseIngestor for NLP categories attaches the fingerprint before global metadata when present; missing tokenizer logs a warning and registration still completes (optional path for text/token classification).

Tests cover extraction variants, DEST-vs-SRC fingerprinting, ingest wiring for NLP vs non-NLP, and registry NLP_CATEGORIES invariants.

Reviewed by Cursor Bugbot for commit 35aab7e. Bugbot is set up for automated code reviews on this repo. Configure here.

Extract a shipped tokenizer.json's 4 structural integers — vocab_size,
mask_token_id, pad_token_id, tokenizer_type — at ingest and ship them on the
existing global-metadata channel (send_global_meta_meta), so the backend can
cross-check a contributor tokenizer at dataset linking (Task 4). FL guardrail:
ONLY these 4 integers cross the cluster boundary — never vocabulary content
and never a hash (a custom clinical/KG tokenizer's vocab is data-derived and
must not be centrally fingerprinted).

- modalities/spec.py + registry.py: new `is_nlp` ModalitySpec flag (mirrors
  `is_self_supervised`); derive `NLP_CATEGORIES` from it (single source — can't
  drift). Exported from modalities/__init__.py.
- validators/tokenizer_validator.py: `extract_tokenizer_metadata` /
  `load_tokenizer_metadata` / `_special_token_id` — pure-JSON parse (no
  tokenizers/transformers dep), reusing the existing `_extract_vocab` so
  vocab_size matches the SDK's upload-time vocab-fit basis.
- file_transfer.py: `_copy_tokenizer_if_present(cfg)` returns the fingerprint;
  new `get_shipped_tokenizer_metadata(cfg)` reads SRC_PATH/tokenizer.json
  (Config threaded per the P4 refactor).
- ingestors/base.py: for NLP datasets, attach `file_options["tokenizer"]`
  before send_global_meta_meta, using the run's resolved Config.

Non-breaking: text/token classification keep an OPTIONAL tokenizer (warn +
skip the cross-check when absent — the epic's legacy/skipped path); MLM keeps
its mandatory tokenizer. No backend migration — GlobalMetaData.meta_data is a
JSONField.

Tests: tests/test_tokenizer_fingerprint.py (extraction + FL guardrail +
file_transfer helpers), 3 ingest-flow tests in test_ingestor_base.py, and
NLP_CATEGORIES assertions in test_modality_registry.py. Full suite: 1101 pass.

Part of tracebloc/backend#805 (Task 2 of 5).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shujaatTracebloc shujaatTracebloc self-assigned this Jun 16, 2026
@LukasWodka

Copy link
Copy Markdown
Collaborator

👋 Heads-up — Code review queue is at 20 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Comment thread tracebloc_ingestor/validators/tokenizer_validator.py Outdated
…an id

_special_token_id returned entry.get("id") as soon as an added_tokens entry
matched the token string — even when that entry carried no "id" — so a
malformed entry shadowed the model.vocab mapping that does hold the id, and
pad_token_id / mask_token_id could be registered as None. Only return the
added_tokens id when present; otherwise fall through to the vocab lookup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f4c4432. Configure here.

Comment thread tracebloc_ingestor/file_transfer.py Outdated
get_shipped_tokenizer_metadata read SRC_PATH/tokenizer.json, but
_copy_tokenizer_if_present copies SRC->DEST only once (skips when DEST exists).
On a re-ingest that ships a changed source the client keeps training on the
existing DEST file, so fingerprinting SRC would register metadata for a
tokenizer the site never trained on — breaking the contributor cross-check at
linking. Read DEST_PATH/tokenizer.json (the file the client actually loads) so
the registered fingerprint always matches what is trained on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shujaatTracebloc shujaatTracebloc merged commit 751a98a into develop Jun 17, 2026
4 checks passed
@shujaatTracebloc shujaatTracebloc deleted the feature/805-task2-tokenizer-ingest-metadata branch June 17, 2026 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants