Skip to content

fix(label): strip whitespace from label values to prevent silent class duplication (issue #261)#262

Merged
divyasinghds merged 1 commit into
developfrom
fix/strip-label-column-values
Jun 15, 2026
Merged

fix(label): strip whitespace from label values to prevent silent class duplication (issue #261)#262
divyasinghds merged 1 commit into
developfrom
fix/strip-label-column-values

Conversation

@divyasinghds

@divyasinghds divyasinghds commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Closes #261.

Surfaced by adversarial new-user testing against v0.3.10-rc2 (scenario N15).

The corruption

A new-user CSV — innocuous-looking:

```csv
id,name,label
1,Alice, A
2,Bob,B
3,Carol,A
```

User intends 2 classes (A, B). Whitespace-padded `" A "` row 1 is invisible in a notebook view. Ingestor reports:

```
🎉 Ingestion completed successfully! 3 records inserted.
```

MySQL contents:

```sql
SELECT label, LENGTH(label), HEX(label), COUNT() FROM

GROUP BY label;
label LENGTH(label) HEX(label) COUNT(
)
A 5 2020412020 1 <-- intended as 'A'
A 1 41 1
B 1 42 1
```

LabelDiversityValidator counted 3 distinct ≥ 2 → passed. A model trained on this learns 3 classes instead of 2, and silently misclassifies any clean `A` example at inference because `A` and ` A ` were trained as different things.

What the fix does

Write side — `BaseIngestor.process_record`

Strip `label_val.strip()` for string labels before `label_policy.apply` runs. Mirrors what the framework already does for the `data_id` column (line below) and for column HEADERS (`chunk.columns.str.strip()` in csv_ingestor). A label is a class identifier; class identifiers don't carry whitespace semantics.

Numeric labels (INT class IDs) and other non-strings pass through unchanged. Other VARCHAR columns — descriptions, names, free-text — are left alone; only the label column gets stripped.

Preflight WARNING — `LabelDiversityValidator`

Collapse whitespace-equivalent values BEFORE counting distinct, AND surface a WARNING listing the pre-strip variants when collapsing changed anything. The validator and the write side now use the same distinct count.

A dataset with `[" A ", "A", "B"]`:

  • Validator: distinct=2 (A, B), warning lists `{"A": [" A ", "A"]}`
  • Ingest: stores 2 classes in MySQL, as intended

A dataset where ALL variants collapse to a single class (`[" A ", "A", " A"]`) still fails diversity, with a message that explicitly mentions whitespace stripping.

Test plan

  • 2 new tests in `test_label_diversity_validator.py`:
    • mixed dataset passes; warning names the offending pre-strip variants
    • all-collapse case still rejects with clear message about stripping
  • 3 new tests in `test_ingestor_base.py`:
    • `" A "` → cleaned `"A"`
    • Two whitespace variants produce identical cleaned labels
    • INT label (numeric class ID) passes through unchanged
  • Existing label tests (Single-label classification dataset surfaces as misleading 'Backend failed to prepare' instead of a clear preflight error #251 cases, bucket policy, real-value preservation) continue to pass
  • CI green
  • Re-run N15: MySQL has 2 distinct classes, validator emits warning naming the whitespace variants

Related

Minor follow-ups (flagged in #261, not in this PR)

  • N12 (JSONL): opaque "Trailing data" error — separate UX fix
  • N3 (BIGINTEGER typo): no "did you mean BIGINT?" hint — separate small PR

🤖 Generated with Claude Code


Note

Medium Risk
Changes label normalization on every classification ingest and preflight diversity checks; behavior is narrow (label column strings only) but affects training class sets for existing messy CSVs.

Overview
Fixes silent label-set corruption where padded strings like " A " and "A" were stored as separate MySQL classes while preflight could still report enough distinct labels (#261).

Ingestion: BaseIngestor now strip()s string label values before label_policy.apply, aligned with data_id and CSV header trimming; non-string labels are unchanged.

Preflight: LabelDiversityValidator collapses whitespace-equivalent distinct values when counting classes, emits a warning listing raw variants when collapse occurs, and mentions stripping in single-class rejection messages so counts match what ingestion will write.

Tests: Five new cases cover ingest stripping/equivalence/numeric passthrough and validator collapse, warnings, and post-strip single-class failure.

Reviewed by Cursor Bugbot for commit 2ac0f77. Bugbot is set up for automated code reviews on this repo. Configure here.

…s duplication (issue #261)

A new-user CSV with ``"  A  "`` (whitespace-padded) mixed with ``"A"``
landed in MySQL as TWO distinct classes — the ingestor reported
``🎉 Ingestion completed successfully!``, the LabelDiversityValidator
happily passed (2 distinct values ≥ 2), and a model trained on the
output learned 3 classes instead of the 2 the user intended. Silent
label-set corruption.

The framework already strips column HEADERS (``chunk.columns.str.strip()``
in csv_ingestor) and the ``data_id`` column (line above in
process_record). Extending the same convention to label-column VALUES is
internally consistent — a label is a class identifier, and class
identifiers don't carry whitespace semantics. (Other VARCHAR columns —
free-text descriptions, names, etc. — are left alone; only the label
column gets this treatment.)

## Two changes

### 1. Write-side strip — ``BaseIngestor.process_record``

Strip ``label_val.strip()`` for string labels before ``label_policy.apply``
runs. Numeric labels (INT class IDs) and other non-strings pass through
unchanged.

### 2. Preflight WARNING — ``LabelDiversityValidator``

Collapse whitespace-equivalent values BEFORE counting distinct, AND
surface a WARNING listing the pre-strip variants when collapsing
changed anything. The validator and the write side now use the same
distinct count, so a dataset with ``["  A  ", "A", "B"]``:
  - Validator: distinct=2 (A, B), warning lists ``{"A": ["  A  ", "A"]}``
  - Ingest: stores ``A`` and ``B`` — 2 classes in MySQL, as intended

A dataset where ALL variants collapse to a single class
(``["  A  ", "A", "  A"]``) still fails diversity:
  - Validator: distinct=1, rejects with a message that explicitly
    mentions whitespace stripping so the user knows what the validator
    did

## Tests

``test_label_diversity_validator.py`` (2 new):
  - ``test_whitespace_duplicates_collapsed_and_warned`` — mixed
    dataset passes; warning names the variants
  - ``test_whitespace_only_single_value_still_fails`` — all-collapse
    case rejects with a clear message

``test_ingestor_base.py`` (3 new):
  - ``test_process_record_strips_whitespace_from_string_label`` —
    ``"  A  "`` → ``"A"``
  - ``test_process_record_label_strip_makes_whitespace_variants_equivalent``
    — two records with whitespace variants produce identical cleaned
    labels
  - ``test_process_record_label_strip_preserves_non_string_labels`` —
    INT 42 passes through unchanged

## Surfaced by

Adversarial new-user test pass against v0.3.10-rc2 (scenario N15):
shipped CSV had ``"  A  "`` from an Excel paste, MySQL ended up with 3
distinct classes ``{"  A  ", "A", "B"}`` and the ingestor reported
success.

## Related

- #251: LabelDiversityValidator caught the inverse case (1 distinct).
  This fixes the silent-but-more-distinct sibling: validator accepts
  because the strings are bytewise distinct, but the user has fewer
  semantic classes than that.
- The whitespace strip on column HEADERS in csv_ingestor (already in
  develop) establishes the convention; this extends it to label values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@divyasinghds divyasinghds self-assigned this Jun 15, 2026
@LukasWodka

Copy link
Copy Markdown
Collaborator

👋 Heads-up — Code review queue is at 12 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@divyasinghds divyasinghds merged commit cab8514 into develop Jun 15, 2026
7 checks passed
@divyasinghds divyasinghds deleted the fix/strip-label-column-values branch June 15, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants