fix(label): strip whitespace from label values to prevent silent class duplication (issue #261)#262
Merged
Merged
Conversation
…s duplication (issue #261) A new-user CSV with ``" A "`` (whitespace-padded) mixed with ``"A"`` landed in MySQL as TWO distinct classes — the ingestor reported ``🎉 Ingestion completed successfully!``, the LabelDiversityValidator happily passed (2 distinct values ≥ 2), and a model trained on the output learned 3 classes instead of the 2 the user intended. Silent label-set corruption. The framework already strips column HEADERS (``chunk.columns.str.strip()`` in csv_ingestor) and the ``data_id`` column (line above in process_record). Extending the same convention to label-column VALUES is internally consistent — a label is a class identifier, and class identifiers don't carry whitespace semantics. (Other VARCHAR columns — free-text descriptions, names, etc. — are left alone; only the label column gets this treatment.) ## Two changes ### 1. Write-side strip — ``BaseIngestor.process_record`` Strip ``label_val.strip()`` for string labels before ``label_policy.apply`` runs. Numeric labels (INT class IDs) and other non-strings pass through unchanged. ### 2. Preflight WARNING — ``LabelDiversityValidator`` Collapse whitespace-equivalent values BEFORE counting distinct, AND surface a WARNING listing the pre-strip variants when collapsing changed anything. The validator and the write side now use the same distinct count, so a dataset with ``[" A ", "A", "B"]``: - Validator: distinct=2 (A, B), warning lists ``{"A": [" A ", "A"]}`` - Ingest: stores ``A`` and ``B`` — 2 classes in MySQL, as intended A dataset where ALL variants collapse to a single class (``[" A ", "A", " A"]``) still fails diversity: - Validator: distinct=1, rejects with a message that explicitly mentions whitespace stripping so the user knows what the validator did ## Tests ``test_label_diversity_validator.py`` (2 new): - ``test_whitespace_duplicates_collapsed_and_warned`` — mixed dataset passes; warning names the variants - ``test_whitespace_only_single_value_still_fails`` — all-collapse case rejects with a clear message ``test_ingestor_base.py`` (3 new): - ``test_process_record_strips_whitespace_from_string_label`` — ``" A "`` → ``"A"`` - ``test_process_record_label_strip_makes_whitespace_variants_equivalent`` — two records with whitespace variants produce identical cleaned labels - ``test_process_record_label_strip_preserves_non_string_labels`` — INT 42 passes through unchanged ## Surfaced by Adversarial new-user test pass against v0.3.10-rc2 (scenario N15): shipped CSV had ``" A "`` from an Excel paste, MySQL ended up with 3 distinct classes ``{" A ", "A", "B"}`` and the ingestor reported success. ## Related - #251: LabelDiversityValidator caught the inverse case (1 distinct). This fixes the silent-but-more-distinct sibling: validator accepts because the strings are bytewise distinct, but the user has fewer semantic classes than that. - The whitespace strip on column HEADERS in csv_ingestor (already in develop) establishes the convention; this extends it to label values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
👋 Heads-up — Code review queue is at 12 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
This was referenced Jun 15, 2026
aptracebloc
approved these changes
Jun 15, 2026
saadqbal
approved these changes
Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #261.
Surfaced by adversarial new-user testing against v0.3.10-rc2 (scenario N15).
The corruption
A new-user CSV — innocuous-looking:
```csv
id,name,label
1,Alice, A
2,Bob,B
3,Carol,A
```
User intends 2 classes (A, B). Whitespace-padded `" A "` row 1 is invisible in a notebook view. Ingestor reports:
```
🎉 Ingestion completed successfully! 3 records inserted.
```
MySQL contents:
```sql
GROUP BY label;SELECT label, LENGTH(label), HEX(label), COUNT() FROM
label LENGTH(label) HEX(label) COUNT()
A 5 2020412020 1 <-- intended as 'A'
A 1 41 1
B 1 42 1
```
LabelDiversityValidator counted 3 distinct ≥ 2 → passed. A model trained on this learns 3 classes instead of 2, and silently misclassifies any clean `A` example at inference because `A` and ` A ` were trained as different things.
What the fix does
Write side — `BaseIngestor.process_record`
Strip `label_val.strip()` for string labels before `label_policy.apply` runs. Mirrors what the framework already does for the `data_id` column (line below) and for column HEADERS (`chunk.columns.str.strip()` in csv_ingestor). A label is a class identifier; class identifiers don't carry whitespace semantics.
Numeric labels (INT class IDs) and other non-strings pass through unchanged. Other VARCHAR columns — descriptions, names, free-text — are left alone; only the label column gets stripped.
Preflight WARNING — `LabelDiversityValidator`
Collapse whitespace-equivalent values BEFORE counting distinct, AND surface a WARNING listing the pre-strip variants when collapsing changed anything. The validator and the write side now use the same distinct count.
A dataset with `[" A ", "A", "B"]`:
A dataset where ALL variants collapse to a single class (`[" A ", "A", " A"]`) still fails diversity, with a message that explicitly mentions whitespace stripping.
Test plan
Related
Minor follow-ups (flagged in #261, not in this PR)
🤖 Generated with Claude Code
Note
Medium Risk
Changes label normalization on every classification ingest and preflight diversity checks; behavior is narrow (label column strings only) but affects training class sets for existing messy CSVs.
Overview
Fixes silent label-set corruption where padded strings like
" A "and"A"were stored as separate MySQL classes while preflight could still report enough distinct labels (#261).Ingestion:
BaseIngestornowstrip()s string label values beforelabel_policy.apply, aligned withdata_idand CSV header trimming; non-string labels are unchanged.Preflight:
LabelDiversityValidatorcollapses whitespace-equivalent distinct values when counting classes, emits a warning listing raw variants when collapse occurs, and mentions stripping in single-class rejection messages so counts match what ingestion will write.Tests: Five new cases cover ingest stripping/equivalence/numeric passthrough and validator collapse, warnings, and post-strip single-class failure.
Reviewed by Cursor Bugbot for commit 2ac0f77. Bugbot is set up for automated code reviews on this repo. Configure here.