fix(ingest): catch single-label classification at preflight + surface backend reason (issue #251) by divyasinghds · Pull Request #252 · tracebloc/data-ingestors

divyasinghds · 2026-06-15T08:14:05Z

Closes #251.

Two coupled improvements to the user experience when a classification dataset has fewer than 2 distinct label values — surfaced by adversarial testing against v0.3.10-rc1.

The cascade we hit

Submit a `tabular_classification` (or any classification-family) dataset where the `label` column has ONE distinct value across all rows (e.g. all rows labelled `"X"`):

Validators pass — no per-record issue, just label uniformity
All rows insert into MySQL successfully
`prepare_dataset` calls `/global_meta/prepare/`
Backend returns HTTP 400: `{"message":"Please provide atleast 2 labels."}`
Framework raises generic: "Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above."

The user-visible error tells you to grep the log. The actual reason is one ERROR-line up. Rows are orphaned in MySQL by the time anything surfaces.

Fix 1 — `LabelDiversityValidator` at preflight

New validator: `tracebloc_ingestor/validators/label_diversity_validator.py`.

Reads only the label column when given a CSV path (`usecols=[...]` — a multi-GB proteomics panel doesn't get materialised just to count labels).
Counts non-null distinct values; rejects when `< min_distinct` (default 2 — matches the backend's exact contract).
Error message names the offending distinct value(s) and value-count breakdown:

```
Classification category requires at least 2 distinct label values in
column 'label'; this dataset has 1 distinct value(s): ['X'].
Value counts: {'X': 10}. If this is intentional (e.g. you have a
continuous target), pick a regression-family category like
tabular_regression or time_series_forecasting instead.
```
Suggests regression-family categories so users with continuous targets get a pointer instead of being told to fake a second class.
Gracefully no-ops on empty input / missing label column — other validators already surface those.

Wired in via `validators_mapping.py` for every classification-family category:

`tabular_classification`, `image_classification`, `text_classification`, `object_detection`, `semantic_segmentation`, `keypoint_detection`

NOT wired for regression (`tabular_regression`, `time_series_forecasting`, `time_to_event_prediction` — continuous targets) or self-supervised (`masked_language_modeling`, `token_classification`).

Fix 2 — propagate backend body into the user-visible error

`APIClient.prepare_dataset` now stashes the formatted backend response on `self.last_prepare_error` (`HTTP : `). `base.py`'s "Backend failed to prepare" RuntimeError reads it and includes the actual backend reason verbatim.

Before:

Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above.

After:

Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). Backend response: HTTP 400: {"message":"Please provide atleast 2 labels."}

A user piping the log through filtered output (CI, error tooling, paste-into-Slack) sees the cause in the visible line — no more grep.

What this PR does NOT do

The backend typo `"atleast"` → `"at least"` lives in the backend repo, not here. Flagged in Single-label classification dataset surfaces as misleading 'Backend failed to prepare' instead of a clear preflight error #251 for whoever owns that endpoint.

Test plan

14 new tests in `tests/test_label_diversity_validator.py`:
- Positive: 2+ labels, many labels, nulls don't count, case-insensitive column match
- Failure: single label (with value-counts in message), all-null, one-value-plus-nulls
- Error message mentions the regression alternative
- Defensive: empty input / missing column → silent pass (other validators own those)
- CSV-path: reads only the label column; rejects single-label CSV at the file level
- Custom label column name
3 new tests in `tests/test_api_client_methods.py`:
- On HTTP error, `last_prepare_error` populated with status + body
- Clean ingestor: `last_prepare_error is None`
- Network error (no response): still populated with stringified exception
CI green
Re-run S5/S8 from the adversarial pass:
- Single-label data → fails fast at preflight with the value-counts message (no MySQL writes, no backend round-trip)
- Multi-label data → continues to pass end-to-end

🤖 Generated with Claude Code

Note

Medium Risk
Changes preflight gating and post-commit registration error messaging on a critical ingest path; mis-tuned label counting could block valid datasets or still allow bad ones, though extensive tests target parity with CSV ingest rules.

Overview
Adds LabelDiversityValidator so classification-family ingests fail before MySQL writes when the label column has fewer than two distinct non-null values, with errors that list offending values/counts and suggest regression categories when appropriate. It is wired through validators_mapping for tabular/image/text/object/semantic/keypoint classification (not token classification or regression/self-supervised), reads CSVs via label-only usecols with NA/dtype rules aligned to ingest via full_schema from base.py, and includes hardening for headers, read failures, and schema vs non-schema NA handling (#252).

APIClient.prepare_dataset now clears and sets last_prepare_error on HTTP/network failures; base.py includes that text in the RuntimeError when prepare fails so users see the backend body (e.g. “Please provide atleast 2 labels.”) instead of only “see the logged API error above.”

Tests cover the validator, mapping, and last_prepare_error behavior.

^{Reviewed by Cursor Bugbot for commit 4d24151. Bugbot is set up for automated code reviews on this repo. Configure here.}

… backend reason (issue #251) Two coupled improvements to the user experience when a classification dataset has fewer than 2 distinct label values — surfaced by adversarial testing against v0.3.10-rc1. ## The cascade we hit Submit a `tabular_classification` (or any classification-family) dataset where the `label` column has only ONE distinct value across all rows (e.g. a 10-row test fixture all labelled `"X"`): 1. Validators pass — no per-record issue, just label uniformity. 2. All rows insert into MySQL successfully. 3. `prepare_dataset` calls `/global_meta/prepare/`. 4. Backend returns HTTP 400: `{"message":"Please provide atleast 2 labels."}` 5. The framework raises a generic `"Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above."` The user-visible error says "see the logged API error above" — but the ACTUAL reason ("Please provide atleast 2 labels") is one ERROR layer up and easy to miss. Rows are orphaned in MySQL by the time anything surfaces. ## Fix 1 — preflight validator (the right layer) New `LabelDiversityValidator` in `tracebloc_ingestor/validators/label_diversity_validator.py`: - Reads ONLY the label column when given a CSV path (usecols=[...] — a multi-GB proteomics panel doesn't get materialised to count labels). - Counts non-null distinct values; rejects when < min_distinct (default 2 — matching the backend's exact contract). - Error message NAMES the offending distinct value(s) and the value-count breakdown so the user immediately sees what's wrong with the input ("1 distinct value(s): ['X']. Value counts: {'X': 10}"). - Suggests the regression-family categories as the right home for a single-target dataset, so users with continuous targets get a pointer instead of being told to fake a second class. - Gracefully no-ops on empty input or a missing label column — other validators already surface those clearly; don't double-report. Wired in via `validators_mapping.py` for every classification-family category: tabular_classification, image_classification, text_classification, object_detection, semantic_segmentation, and keypoint_detection. NOT wired for regression (tabular_regression / time_series_forecasting / time_to_event_prediction) — continuous targets legitimately have "1 distinct value" semantics — nor for masked_language_modeling / token_classification (self-supervised or per-token labels). ## Fix 2 — propagate backend body into the user-visible error `APIClient.prepare_dataset`'s error handler now stashes the formatted backend response (`HTTP <status>: <body>`) on `self.last_prepare_error`. `base.py`'s "Backend failed to prepare" RuntimeError reads that and includes the actual backend reason verbatim in the user-visible message. Before: Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above. After: Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). Backend response: HTTP 400: {"message":"Please provide atleast 2 labels."} A user piping the log through filtered output (CI, error tooling, paste- into-Slack) sees the cause in the visible line — no more grep required. ## Tests `tests/test_label_diversity_validator.py` (14 cases): - Positive: 2+ labels, many labels, nulls don't count, case-insensitive column match - Failure: single label (with value-counts in message), all-null, one-value-plus-nulls - Error mentions the regression alternative for single-target users - Defensive: empty input / missing column → silent pass (other validators own those errors) - CSV-path: reads only the label column (no full-table OOM risk), rejects single-label CSV at the file level - Custom label column name Extended `tests/test_api_client_methods.py`: - On HTTP error, `last_prepare_error` populated with status + body - On a clean ingestor, `last_prepare_error is None` - On network error (no response), still populated with the stringified exception (Backend-side typo "atleast" → "at least" lives in the backend repo — flagged in #251 for whoever maintains that endpoint.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LukasWodka · 2026-06-15T08:15:02Z

👋 Heads-up — Code review queue is at 16 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#106 — test(integration): pin XGBoost/LightGBM/CatBoost end-to-end averaging round-trips · author: @LukasWodka · no reviewer assigned
averaging-service#111 — fix(averaging): correctness round-2 sweep — N5/N7/N8/N9/N10 + load_model except (Sync develop → master for v0.3.0 release #109) · author: @LukasWodka · no reviewer assigned
client#254 — Sync develop → main for v1.7.1 chart release (egress-enforcement preflight helm test, inert) · author: @saadqbal · no reviewer assigned
client#256 — fix(egress-proxy): probe reports DNS failure distinctly, not as a TCP connect [image_validator falsely rejects images when target_size matches exactly #104] · author: @saadqbal · no reviewer assigned
client-runtime#105 — chore(ci): drop unused mysqlclient dep; report test coverage (Phase 0) · author: @LukasWodka · no reviewer assigned
client-runtime#106 — fix(schema): sync vendored ingest.v1.json — restore token_classification · author: @LukasWodka · no reviewer assigned
data-ingestors#248 — refactor(P2): extract ingestion-summary renderer into reporting.ConsoleRenderer · author: @LukasWodka · no reviewer assigned
data-ingestors#249 — fix(database): CHAR(N) must map to a SQLAlchemy type in _get_sqlalchemy_type · author: @divyasinghds · no reviewer assigned
data-ingestors#250 — fix(csv): empty CSV must fail fast with clear input-error message · author: @divyasinghds · no reviewer assigned
design-system#47 — feat(table): Table compound organism (Figma 959:6061 + 986:9822) · author: @waqaskhanroghani · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…ader parse (bugbot #252) Two bugbot findings on the label-diversity preflight (#251): - High: _load_data swallowed read errors and returned None, which validate treats as empty → valid. A single-label CSV whose read failed could skip the gate and hit the backend rejection. Read errors now propagate to validate's handler and fail the check (fail-closed). - Medium: header detection used a naive header_line.split(",") that diverges from pandas on quoted/alternate-delimiter headers. On a miss it fell back to nrows=1 and counted distinct labels on one row, rejecting a diverse dataset. Now resolves the column against pandas' own header parse (nrows=0); a genuinely-absent column returns None (benign skip). Also update test_validators_mapping for the LabelDiversityValidator now in the classification chains (was a stale pre-existing failure on this branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- prepare_dataset clears `last_prepare_error` up front so an early `return False` (local mode / invalid category) can't leave a stale message that base.py then attaches to an unrelated failure. - LabelDiversityValidator now accepts the schema so the label column is read with the same NA / dtype rules as CSVIngestor / DataValidator. Without it, the distinct-label count could disagree with what actually ingests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bot #252) Tests for the follow-up fixes in 1539301: - schema label: NA sentinels ("null"/"NA") read as missing → single-label correctly flagged - non-schema label: "NA"/"null" kept as genuine classes (matches ingestor) - string-schema label: numeric-looking values ("01","1","1.0") not collapsed by numeric inference Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f271082. Configure here.}

…d (bugbot #252) base.py deletes the label/annotation/id columns from file_options["schema"] (they're framework columns, not table columns), but CSVIngestor reads the file with NA/dtype rules from the FULL self.schema — so the label column DOES get NA-sentinel treatment at ingest. The label-diversity validator was wired with the stripped schema, so _schema_type_for(label) always returned None and the NA/dtype rules never applied to the label: "null"/"NA"/"" inflated the distinct count at preflight while ingestion treated them as missing, letting an effectively single-class dataset slip past the gate and fail at backend prepare. base.py now also passes the unstripped schema as `full_schema`; the validator factory prefers it (falling back to `schema` for direct/test callers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…bugbot #252) CSVIngestor strips column-name whitespace on read (chunk.columns.str.strip()), so a header like " label " is ingested as `label`. The validator resolved the label column against the raw header, treated " label " as missing, skipped the diversity check with a warning, and a single-class CSV passed preflight only to fail at backend prepare. _resolve_column and _schema_type_for now strip (and lowercase) on both sides, matching the ingestor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Release bundles the fixes merged to develop since 0.3.9 (#230–#254): - Ingestion accounting: dropped records fail the run; JSON read-layer fails fast (#230, #234, #235) - Coercion: single source of truth for NA policy + int64 range (#236, #237) - Security: block path traversal via manifest filename/mask_id (#239) - UX papercuts: delimiter hint, NUL truncation, table-name message, Config numeric coercion (#238) - Schema/DB: real MySQL column types, CHAR(N) mapping, drop instance_segmentation from enum, empty-CSV fast fail (#240, #241, #249, #250) - DataValidator: accept single-dict / filter non-dict JSON (#232, #233) - Single-label classification caught at preflight + friendly backend reason, with the full bugbot-hardened LabelDiversityValidator (#251, #252) - CLI: schema descriptions surfaced in validation errors (#254) - Reporting: ConsoleRenderer extraction; packaging split (#248, #246) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

divyasinghds self-assigned this Jun 15, 2026

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread tracebloc_ingestor/validators/label_diversity_validator.py Outdated

Comment thread tracebloc_ingestor/validators/label_diversity_validator.py Outdated

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread tracebloc_ingestor/validators/label_diversity_validator.py Outdated

Comment thread tracebloc_ingestor/api/client.py

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread tracebloc_ingestor/utils/validators_mapping.py Outdated

cursor Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread tracebloc_ingestor/validators/label_diversity_validator.py

LukasWodka mentioned this pull request Jun 15, 2026

refactor(P3a): move per-category behavior flags into a ModalityRegistry #253

Merged

aptracebloc approved these changes Jun 15, 2026

View reviewed changes

LukasWodka mentioned this pull request Jun 15, 2026

refactor(P3d): source data_format from the ModalityRegistry #257

Merged

divyasinghds merged commit 3f4dbb5 into develop Jun 15, 2026
4 checks passed

divyasinghds deleted the fix/label-diversity-and-clear-error branch June 15, 2026 09:11

divyasinghds mentioned this pull request Jun 15, 2026

Release v0.3.10 (ingestion hardening + path-traversal fix + single-label preflight) #259

Merged

4 tasks

saadqbal mentioned this pull request Jun 15, 2026

Release v0.3.10 (ingestion hardening + path-traversal fix + single-label preflight) #268

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest): catch single-label classification at preflight + surface backend reason (issue #251)#252

fix(ingest): catch single-label classification at preflight + surface backend reason (issue #251)#252
divyasinghds merged 6 commits into
developfrom
fix/label-diversity-and-clear-error

divyasinghds commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

LukasWodka commented Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

divyasinghds commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The cascade we hit

Fix 1 — LabelDiversityValidator at preflight

Fix 2 — propagate backend body into the user-visible error

What this PR does NOT do

Test plan

Uh oh!

LukasWodka commented Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

divyasinghds commented Jun 15, 2026 •

edited by cursor Bot

Loading

Fix 1 — `LabelDiversityValidator` at preflight