fix(ingest): catch single-label classification at preflight + surface backend reason (issue #251)#252
Merged
Conversation
… backend reason (issue #251) Two coupled improvements to the user experience when a classification dataset has fewer than 2 distinct label values — surfaced by adversarial testing against v0.3.10-rc1. ## The cascade we hit Submit a `tabular_classification` (or any classification-family) dataset where the `label` column has only ONE distinct value across all rows (e.g. a 10-row test fixture all labelled `"X"`): 1. Validators pass — no per-record issue, just label uniformity. 2. All rows insert into MySQL successfully. 3. `prepare_dataset` calls `/global_meta/prepare/`. 4. Backend returns HTTP 400: `{"message":"Please provide atleast 2 labels."}` 5. The framework raises a generic `"Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above."` The user-visible error says "see the logged API error above" — but the ACTUAL reason ("Please provide atleast 2 labels") is one ERROR layer up and easy to miss. Rows are orphaned in MySQL by the time anything surfaces. ## Fix 1 — preflight validator (the right layer) New `LabelDiversityValidator` in `tracebloc_ingestor/validators/label_diversity_validator.py`: - Reads ONLY the label column when given a CSV path (usecols=[...] — a multi-GB proteomics panel doesn't get materialised to count labels). - Counts non-null distinct values; rejects when < min_distinct (default 2 — matching the backend's exact contract). - Error message NAMES the offending distinct value(s) and the value-count breakdown so the user immediately sees what's wrong with the input ("1 distinct value(s): ['X']. Value counts: {'X': 10}"). - Suggests the regression-family categories as the right home for a single-target dataset, so users with continuous targets get a pointer instead of being told to fake a second class. - Gracefully no-ops on empty input or a missing label column — other validators already surface those clearly; don't double-report. Wired in via `validators_mapping.py` for every classification-family category: tabular_classification, image_classification, text_classification, object_detection, semantic_segmentation, and keypoint_detection. NOT wired for regression (tabular_regression / time_series_forecasting / time_to_event_prediction) — continuous targets legitimately have "1 distinct value" semantics — nor for masked_language_modeling / token_classification (self-supervised or per-token labels). ## Fix 2 — propagate backend body into the user-visible error `APIClient.prepare_dataset`'s error handler now stashes the formatted backend response (`HTTP <status>: <body>`) on `self.last_prepare_error`. `base.py`'s "Backend failed to prepare" RuntimeError reads that and includes the actual backend reason verbatim in the user-visible message. Before: Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above. After: Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). Backend response: HTTP 400: {"message":"Please provide atleast 2 labels."} A user piping the log through filtered output (CI, error tooling, paste- into-Slack) sees the cause in the visible line — no more grep required. ## Tests `tests/test_label_diversity_validator.py` (14 cases): - Positive: 2+ labels, many labels, nulls don't count, case-insensitive column match - Failure: single label (with value-counts in message), all-null, one-value-plus-nulls - Error mentions the regression alternative for single-target users - Defensive: empty input / missing column → silent pass (other validators own those errors) - CSV-path: reads only the label column (no full-table OOM risk), rejects single-label CSV at the file level - Custom label column name Extended `tests/test_api_client_methods.py`: - On HTTP error, `last_prepare_error` populated with status + body - On a clean ingestor, `last_prepare_error is None` - On network error (no response), still populated with the stringified exception (Backend-side typo "atleast" → "at least" lives in the backend repo — flagged in #251 for whoever maintains that endpoint.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
👋 Heads-up — Code review queue is at 16 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
…ader parse (bugbot #252) Two bugbot findings on the label-diversity preflight (#251): - High: _load_data swallowed read errors and returned None, which validate treats as empty → valid. A single-label CSV whose read failed could skip the gate and hit the backend rejection. Read errors now propagate to validate's handler and fail the check (fail-closed). - Medium: header detection used a naive header_line.split(",") that diverges from pandas on quoted/alternate-delimiter headers. On a miss it fell back to nrows=1 and counted distinct labels on one row, rejecting a diverse dataset. Now resolves the column against pandas' own header parse (nrows=0); a genuinely-absent column returns None (benign skip). Also update test_validators_mapping for the LabelDiversityValidator now in the classification chains (was a stale pre-existing failure on this branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- prepare_dataset clears `last_prepare_error` up front so an early `return False` (local mode / invalid category) can't leave a stale message that base.py then attaches to an unrelated failure. - LabelDiversityValidator now accepts the schema so the label column is read with the same NA / dtype rules as CSVIngestor / DataValidator. Without it, the distinct-label count could disagree with what actually ingests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bot #252) Tests for the follow-up fixes in 1539301: - schema label: NA sentinels ("null"/"NA") read as missing → single-label correctly flagged - non-schema label: "NA"/"null" kept as genuine classes (matches ingestor) - string-schema label: numeric-looking values ("01","1","1.0") not collapsed by numeric inference Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f271082. Configure here.
…d (bugbot #252) base.py deletes the label/annotation/id columns from file_options["schema"] (they're framework columns, not table columns), but CSVIngestor reads the file with NA/dtype rules from the FULL self.schema — so the label column DOES get NA-sentinel treatment at ingest. The label-diversity validator was wired with the stripped schema, so _schema_type_for(label) always returned None and the NA/dtype rules never applied to the label: "null"/"NA"/"" inflated the distinct count at preflight while ingestion treated them as missing, letting an effectively single-class dataset slip past the gate and fail at backend prepare. base.py now also passes the unstripped schema as `full_schema`; the validator factory prefers it (falling back to `schema` for direct/test callers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bugbot #252) CSVIngestor strips column-name whitespace on read (chunk.columns.str.strip()), so a header like " label " is ingested as `label`. The validator resolved the label column against the raw header, treated " label " as missing, skipped the diversity check with a warning, and a single-class CSV passed preflight only to fail at backend prepare. _resolve_column and _schema_type_for now strip (and lowercase) on both sides, matching the ingestor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Jun 15, 2026
Merged
aptracebloc
approved these changes
Jun 15, 2026
4 tasks
divyasinghds
added a commit
that referenced
this pull request
Jun 15, 2026
Release bundles the fixes merged to develop since 0.3.9 (#230–#254): - Ingestion accounting: dropped records fail the run; JSON read-layer fails fast (#230, #234, #235) - Coercion: single source of truth for NA policy + int64 range (#236, #237) - Security: block path traversal via manifest filename/mask_id (#239) - UX papercuts: delimiter hint, NUL truncation, table-name message, Config numeric coercion (#238) - Schema/DB: real MySQL column types, CHAR(N) mapping, drop instance_segmentation from enum, empty-CSV fast fail (#240, #241, #249, #250) - DataValidator: accept single-dict / filter non-dict JSON (#232, #233) - Single-label classification caught at preflight + friendly backend reason, with the full bugbot-hardened LabelDiversityValidator (#251, #252) - CLI: schema descriptions surfaced in validation errors (#254) - Reporting: ConsoleRenderer extraction; packaging split (#248, #246) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Closes #251.
Two coupled improvements to the user experience when a classification dataset has fewer than 2 distinct label values — surfaced by adversarial testing against v0.3.10-rc1.
The cascade we hit
Submit a `tabular_classification` (or any classification-family) dataset where the `label` column has ONE distinct value across all rows (e.g. all rows labelled `"X"`):
The user-visible error tells you to grep the log. The actual reason is one ERROR-line up. Rows are orphaned in MySQL by the time anything surfaces.
Fix 1 —
LabelDiversityValidatorat preflightNew validator: `tracebloc_ingestor/validators/label_diversity_validator.py`.
Reads only the label column when given a CSV path (`usecols=[...]` — a multi-GB proteomics panel doesn't get materialised just to count labels).
Counts non-null distinct values; rejects when `< min_distinct` (default 2 — matches the backend's exact contract).
Error message names the offending distinct value(s) and value-count breakdown:
```
Classification category requires at least 2 distinct label values in
column 'label'; this dataset has 1 distinct value(s): ['X'].
Value counts: {'X': 10}. If this is intentional (e.g. you have a
continuous target), pick a regression-family category like
tabular_regression or time_series_forecasting instead.
```
Suggests regression-family categories so users with continuous targets get a pointer instead of being told to fake a second class.
Gracefully no-ops on empty input / missing label column — other validators already surface those.
Wired in via `validators_mapping.py` for every classification-family category:
NOT wired for regression (`tabular_regression`, `time_series_forecasting`, `time_to_event_prediction` — continuous targets) or self-supervised (`masked_language_modeling`, `token_classification`).
Fix 2 — propagate backend body into the user-visible error
`APIClient.prepare_dataset` now stashes the formatted backend response on `self.last_prepare_error` (`HTTP : `). `base.py`'s "Backend failed to prepare" RuntimeError reads it and includes the actual backend reason verbatim.
Before:
After:
A user piping the log through filtered output (CI, error tooling, paste-into-Slack) sees the cause in the visible line — no more grep.
What this PR does NOT do
Test plan
🤖 Generated with Claude Code
Note
Medium Risk
Changes preflight gating and post-commit registration error messaging on a critical ingest path; mis-tuned label counting could block valid datasets or still allow bad ones, though extensive tests target parity with CSV ingest rules.
Overview
Adds
LabelDiversityValidatorso classification-family ingests fail before MySQL writes when the label column has fewer than two distinct non-null values, with errors that list offending values/counts and suggest regression categories when appropriate. It is wired throughvalidators_mappingfor tabular/image/text/object/semantic/keypoint classification (not token classification or regression/self-supervised), reads CSVs via label-onlyusecolswith NA/dtype rules aligned to ingest viafull_schemafrombase.py, and includes hardening for headers, read failures, and schema vs non-schema NA handling (#252).APIClient.prepare_datasetnow clears and setslast_prepare_erroron HTTP/network failures;base.pyincludes that text in the RuntimeError when prepare fails so users see the backend body (e.g. “Please provide atleast 2 labels.”) instead of only “see the logged API error above.”Tests cover the validator, mapping, and
last_prepare_errorbehavior.Reviewed by Cursor Bugbot for commit 4d24151. Bugbot is set up for automated code reviews on this repo. Configure here.