Skip to content

fix(ingest): catch single-label classification at preflight + surface backend reason (issue #251)#252

Merged
divyasinghds merged 6 commits into
developfrom
fix/label-diversity-and-clear-error
Jun 15, 2026
Merged

fix(ingest): catch single-label classification at preflight + surface backend reason (issue #251)#252
divyasinghds merged 6 commits into
developfrom
fix/label-diversity-and-clear-error

Conversation

@divyasinghds

@divyasinghds divyasinghds commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Closes #251.

Two coupled improvements to the user experience when a classification dataset has fewer than 2 distinct label values — surfaced by adversarial testing against v0.3.10-rc1.

The cascade we hit

Submit a `tabular_classification` (or any classification-family) dataset where the `label` column has ONE distinct value across all rows (e.g. all rows labelled `"X"`):

  1. Validators pass — no per-record issue, just label uniformity
  2. All rows insert into MySQL successfully
  3. `prepare_dataset` calls `/global_meta/prepare/`
  4. Backend returns HTTP 400: `{"message":"Please provide atleast 2 labels."}`
  5. Framework raises generic: "Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above."

The user-visible error tells you to grep the log. The actual reason is one ERROR-line up. Rows are orphaned in MySQL by the time anything surfaces.

Fix 1 — LabelDiversityValidator at preflight

New validator: `tracebloc_ingestor/validators/label_diversity_validator.py`.

  • Reads only the label column when given a CSV path (`usecols=[...]` — a multi-GB proteomics panel doesn't get materialised just to count labels).

  • Counts non-null distinct values; rejects when `< min_distinct` (default 2 — matches the backend's exact contract).

  • Error message names the offending distinct value(s) and value-count breakdown:

    ```
    Classification category requires at least 2 distinct label values in
    column 'label'; this dataset has 1 distinct value(s): ['X'].
    Value counts: {'X': 10}. If this is intentional (e.g. you have a
    continuous target), pick a regression-family category like
    tabular_regression or time_series_forecasting instead.
    ```

  • Suggests regression-family categories so users with continuous targets get a pointer instead of being told to fake a second class.

  • Gracefully no-ops on empty input / missing label column — other validators already surface those.

Wired in via `validators_mapping.py` for every classification-family category:

  • `tabular_classification`, `image_classification`, `text_classification`, `object_detection`, `semantic_segmentation`, `keypoint_detection`

NOT wired for regression (`tabular_regression`, `time_series_forecasting`, `time_to_event_prediction` — continuous targets) or self-supervised (`masked_language_modeling`, `token_classification`).

Fix 2 — propagate backend body into the user-visible error

`APIClient.prepare_dataset` now stashes the formatted backend response on `self.last_prepare_error` (`HTTP : `). `base.py`'s "Backend failed to prepare" RuntimeError reads it and includes the actual backend reason verbatim.

Before:

Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above.

After:

Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). Backend response: HTTP 400: {"message":"Please provide atleast 2 labels."}

A user piping the log through filtered output (CI, error tooling, paste-into-Slack) sees the cause in the visible line — no more grep.

What this PR does NOT do

Test plan

  • 14 new tests in `tests/test_label_diversity_validator.py`:
    • Positive: 2+ labels, many labels, nulls don't count, case-insensitive column match
    • Failure: single label (with value-counts in message), all-null, one-value-plus-nulls
    • Error message mentions the regression alternative
    • Defensive: empty input / missing column → silent pass (other validators own those)
    • CSV-path: reads only the label column; rejects single-label CSV at the file level
    • Custom label column name
  • 3 new tests in `tests/test_api_client_methods.py`:
    • On HTTP error, `last_prepare_error` populated with status + body
    • Clean ingestor: `last_prepare_error is None`
    • Network error (no response): still populated with stringified exception
  • CI green
  • Re-run S5/S8 from the adversarial pass:
    • Single-label data → fails fast at preflight with the value-counts message (no MySQL writes, no backend round-trip)
    • Multi-label data → continues to pass end-to-end

🤖 Generated with Claude Code


Note

Medium Risk
Changes preflight gating and post-commit registration error messaging on a critical ingest path; mis-tuned label counting could block valid datasets or still allow bad ones, though extensive tests target parity with CSV ingest rules.

Overview
Adds LabelDiversityValidator so classification-family ingests fail before MySQL writes when the label column has fewer than two distinct non-null values, with errors that list offending values/counts and suggest regression categories when appropriate. It is wired through validators_mapping for tabular/image/text/object/semantic/keypoint classification (not token classification or regression/self-supervised), reads CSVs via label-only usecols with NA/dtype rules aligned to ingest via full_schema from base.py, and includes hardening for headers, read failures, and schema vs non-schema NA handling (#252).

APIClient.prepare_dataset now clears and sets last_prepare_error on HTTP/network failures; base.py includes that text in the RuntimeError when prepare fails so users see the backend body (e.g. “Please provide atleast 2 labels.”) instead of only “see the logged API error above.”

Tests cover the validator, mapping, and last_prepare_error behavior.

Reviewed by Cursor Bugbot for commit 4d24151. Bugbot is set up for automated code reviews on this repo. Configure here.

… backend reason (issue #251)

Two coupled improvements to the user experience when a classification
dataset has fewer than 2 distinct label values — surfaced by adversarial
testing against v0.3.10-rc1.

## The cascade we hit

Submit a `tabular_classification` (or any classification-family) dataset
where the `label` column has only ONE distinct value across all rows
(e.g. a 10-row test fixture all labelled `"X"`):

  1. Validators pass — no per-record issue, just label uniformity.
  2. All rows insert into MySQL successfully.
  3. `prepare_dataset` calls `/global_meta/prepare/`.
  4. Backend returns HTTP 400: `{"message":"Please provide atleast 2 labels."}`
  5. The framework raises a generic `"Backend failed to prepare the
     dataset; it was NOT registered (its rows are already in the
     database). See the logged API error above."`

The user-visible error says "see the logged API error above" — but the
ACTUAL reason ("Please provide atleast 2 labels") is one ERROR layer up
and easy to miss. Rows are orphaned in MySQL by the time anything
surfaces.

## Fix 1 — preflight validator (the right layer)

New `LabelDiversityValidator` in
`tracebloc_ingestor/validators/label_diversity_validator.py`:

  - Reads ONLY the label column when given a CSV path (usecols=[...] —
    a multi-GB proteomics panel doesn't get materialised to count
    labels).
  - Counts non-null distinct values; rejects when < min_distinct
    (default 2 — matching the backend's exact contract).
  - Error message NAMES the offending distinct value(s) and the
    value-count breakdown so the user immediately sees what's wrong
    with the input ("1 distinct value(s): ['X']. Value counts:
    {'X': 10}").
  - Suggests the regression-family categories as the right home for a
    single-target dataset, so users with continuous targets get a
    pointer instead of being told to fake a second class.
  - Gracefully no-ops on empty input or a missing label column — other
    validators already surface those clearly; don't double-report.

Wired in via `validators_mapping.py` for every classification-family
category: tabular_classification, image_classification,
text_classification, object_detection, semantic_segmentation, and
keypoint_detection. NOT wired for regression
(tabular_regression / time_series_forecasting /
time_to_event_prediction) — continuous targets legitimately have
"1 distinct value" semantics — nor for masked_language_modeling /
token_classification (self-supervised or per-token labels).

## Fix 2 — propagate backend body into the user-visible error

`APIClient.prepare_dataset`'s error handler now stashes the formatted
backend response (`HTTP <status>: <body>`) on
`self.last_prepare_error`. `base.py`'s "Backend failed to prepare"
RuntimeError reads that and includes the actual backend reason verbatim
in the user-visible message.

Before:

    Backend failed to prepare the dataset; it was NOT registered
    (its rows are already in the database). See the logged API error
    above.

After:

    Backend failed to prepare the dataset; it was NOT registered
    (its rows are already in the database). Backend response:
    HTTP 400: {"message":"Please provide atleast 2 labels."}

A user piping the log through filtered output (CI, error tooling, paste-
into-Slack) sees the cause in the visible line — no more grep
required.

## Tests

`tests/test_label_diversity_validator.py` (14 cases):
  - Positive: 2+ labels, many labels, nulls don't count, case-insensitive
    column match
  - Failure: single label (with value-counts in message), all-null,
    one-value-plus-nulls
  - Error mentions the regression alternative for single-target users
  - Defensive: empty input / missing column → silent pass (other
    validators own those errors)
  - CSV-path: reads only the label column (no full-table OOM risk),
    rejects single-label CSV at the file level
  - Custom label column name

Extended `tests/test_api_client_methods.py`:
  - On HTTP error, `last_prepare_error` populated with status + body
  - On a clean ingestor, `last_prepare_error is None`
  - On network error (no response), still populated with the
    stringified exception

(Backend-side typo "atleast" → "at least" lives in the backend repo —
flagged in #251 for whoever maintains that endpoint.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@divyasinghds divyasinghds self-assigned this Jun 15, 2026
@LukasWodka

Copy link
Copy Markdown
Collaborator

👋 Heads-up — Code review queue is at 16 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Comment thread tracebloc_ingestor/validators/label_diversity_validator.py Outdated
Comment thread tracebloc_ingestor/validators/label_diversity_validator.py Outdated
…ader parse (bugbot #252)

Two bugbot findings on the label-diversity preflight (#251):

- High: _load_data swallowed read errors and returned None, which validate
  treats as empty → valid. A single-label CSV whose read failed could skip
  the gate and hit the backend rejection. Read errors now propagate to
  validate's handler and fail the check (fail-closed).

- Medium: header detection used a naive header_line.split(",") that
  diverges from pandas on quoted/alternate-delimiter headers. On a miss it
  fell back to nrows=1 and counted distinct labels on one row, rejecting a
  diverse dataset. Now resolves the column against pandas' own header parse
  (nrows=0); a genuinely-absent column returns None (benign skip).

Also update test_validators_mapping for the LabelDiversityValidator now in
the classification chains (was a stale pre-existing failure on this branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tracebloc_ingestor/validators/label_diversity_validator.py Outdated
Comment thread tracebloc_ingestor/api/client.py
- prepare_dataset clears `last_prepare_error` up front so an early
  `return False` (local mode / invalid category) can't leave a stale
  message that base.py then attaches to an unrelated failure.
- LabelDiversityValidator now accepts the schema so the label column is
  read with the same NA / dtype rules as CSVIngestor / DataValidator.
  Without it, the distinct-label count could disagree with what
  actually ingests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tracebloc_ingestor/utils/validators_mapping.py Outdated
…bot #252)

Tests for the follow-up fixes in 1539301:
- schema label: NA sentinels ("null"/"NA") read as missing → single-label
  correctly flagged
- non-schema label: "NA"/"null" kept as genuine classes (matches ingestor)
- string-schema label: numeric-looking values ("01","1","1.0") not collapsed
  by numeric inference

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f271082. Configure here.

Comment thread tracebloc_ingestor/validators/label_diversity_validator.py
…d (bugbot #252)

base.py deletes the label/annotation/id columns from file_options["schema"]
(they're framework columns, not table columns), but CSVIngestor reads the
file with NA/dtype rules from the FULL self.schema — so the label column DOES
get NA-sentinel treatment at ingest. The label-diversity validator was wired
with the stripped schema, so _schema_type_for(label) always returned None and
the NA/dtype rules never applied to the label: "null"/"NA"/"" inflated the
distinct count at preflight while ingestion treated them as missing, letting
an effectively single-class dataset slip past the gate and fail at backend
prepare.

base.py now also passes the unstripped schema as `full_schema`; the validator
factory prefers it (falling back to `schema` for direct/test callers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bugbot #252)

CSVIngestor strips column-name whitespace on read (chunk.columns.str.strip()),
so a header like " label " is ingested as `label`. The validator resolved the
label column against the raw header, treated " label " as missing, skipped the
diversity check with a warning, and a single-class CSV passed preflight only
to fail at backend prepare. _resolve_column and _schema_type_for now strip
(and lowercase) on both sides, matching the ingestor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@divyasinghds divyasinghds merged commit 3f4dbb5 into develop Jun 15, 2026
4 checks passed
@divyasinghds divyasinghds deleted the fix/label-diversity-and-clear-error branch June 15, 2026 09:11
divyasinghds added a commit that referenced this pull request Jun 15, 2026
Release bundles the fixes merged to develop since 0.3.9 (#230#254):
- Ingestion accounting: dropped records fail the run; JSON read-layer fails
  fast (#230, #234, #235)
- Coercion: single source of truth for NA policy + int64 range (#236, #237)
- Security: block path traversal via manifest filename/mask_id (#239)
- UX papercuts: delimiter hint, NUL truncation, table-name message, Config
  numeric coercion (#238)
- Schema/DB: real MySQL column types, CHAR(N) mapping, drop
  instance_segmentation from enum, empty-CSV fast fail (#240, #241, #249, #250)
- DataValidator: accept single-dict / filter non-dict JSON (#232, #233)
- Single-label classification caught at preflight + friendly backend reason,
  with the full bugbot-hardened LabelDiversityValidator (#251, #252)
- CLI: schema descriptions surfaced in validation errors (#254)
- Reporting: ConsoleRenderer extraction; packaging split (#248, #246)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants