fix(csv): empty CSV must fail fast with clear input-error message by divyasinghds · Pull Request #250 · tracebloc/data-ingestors

divyasinghds · 2026-06-15T07:55:56Z

The cascade

`CSVIngestor.read_data` caught `pd.errors.EmptyDataError` and silently returned an empty generator with only a YELLOW warning. From there:

Ingestor proceeded with an empty record stream
Created an empty MySQL table
Called `send_generate_edge_label_meta` against zero rows
Backend returned HTTP 400: `{"table_name":"No data found for table X and injestor_id Y"}`
PR fix: fail loud when backend dataset registration fails (no silent half-ingest) #187's fail-loud handler surfaced that as: `Backend rejected edge-label metadata; the dataset was NOT registered (its rows are already in the database).`

The user-visible message blames the backend and (worse) falsely tells the user their rows are in the database, when the real cause is an empty input file. Same shape #213 documents for self-supervised + label mismatch — a fault at one layer surfaces with a misleading message several layers down.

Repro

Adversarial test F8 on v0.3.10-rc1: a zero-byte CSV produced:

```
WARNING | csv_ingestor.py:493 | Empty CSV file: /data/shared/divya-adv2/F8_empty.csv
ERROR | api/client.py:393 | Error generating edge label metadata: HTTP 400: {"table_name":"No data found for table name e2etest_adv2_f8_... and injestor_id ..."}
ERROR | base.py:985 | Error during ingestion: Backend rejected edge-label metadata; the dataset was NOT registered (its rows are already in the database).
```

— the empty-file cause is buried in a WARNING that's easy to miss, and the visible ERROR is misdirection.

Fix

Raise `ValueError` at the read layer with a source-truthful message:

```
Empty CSV file: . The file has no header and no rows. Either
stage a non-empty CSV at this path, or check the cluster-side path
(the chart mounts your PVC at /data/shared/ — confirm staging
completed before helm install).
```

DataValidator's existing "No data found to validate" / preflight path catches the ValueError BEFORE any backend round-trip, so the user sees the right reason at the right layer.

Test plan

Updated `test_read_data_empty_file_raises_with_clear_message` (renamed from `..._returns_nothing` to reflect the new contract) asserts the new fail-fast behavior with `pytest.raises(ValueError, match="Empty CSV file")`
Existing parser-error path (ragged rows, etc.) still propagates as before
CI green
Re-run F8 against rc2 → empty-CSV error surfaces at preflight, no backend call

🤖 Generated with Claude Code

Note

Low Risk
Localized CSV read-path behavior change with clearer errors; no auth, security, or broad API surface impact.

Overview
Empty zero-byte CSVs are now treated as hard input errors instead of a successful zero-row read.

CSVIngestor.read_data no longer catches pd.errors.EmptyDataError with a warning and an empty generator. It raises ValueError with a message that names the path and suggests staging / /data/shared/ checks, so users see an input problem at read time rather than a backend HTTP 400 (“No data found for table…”) after an empty table and metadata call.

The test test_read_data_empty_file_raises_with_clear_message replaces the old contract that expected [] and now asserts pytest.raises(ValueError, match="Empty CSV file").

^{Reviewed by Cursor Bugbot for commit 2f37a15. Bugbot is set up for automated code reviews on this repo. Configure here.}

`read_data` caught `pd.errors.EmptyDataError` and silently returned an empty generator (just emitted a YELLOW WARNING). The ingestor then: 1. proceeded with an empty record stream 2. created an empty MySQL table 3. called `send_generate_edge_label_meta` against zero rows 4. backend 400'd with `"No data found for table X and injestor_id Y"` 5. PR #187's fail-loud handler surfaced that as `Backend rejected edge-label metadata; the dataset was NOT registered (its rows are already in the database).` — a message that BLAMES the backend and even falsely tells the user their rows are in the database, when the real cause was an empty input file. Same misleading-cascade shape #213 documents for the self-supervised + label mismatch. Fix: raise ValueError at the read layer with a source-truthful message naming the path and pointing at staging as the most-likely cause (PVC mount, helm-vs-staging timing). DataValidator's existing "No data found to validate" / preflight path catches it BEFORE any backend round-trip, so the user sees the right reason at the right layer. Updates the existing `test_read_data_empty_file_returns_nothing` (which asserted the old silent-empty behavior) to assert the new fail-fast contract instead. Renamed to `test_read_data_empty_file_raises_with_clear_message`. Surfaced by an adversarial test pass against v0.3.10-rc1: an empty CSV ended a chain that surfaced a confusing backend-blaming error several layers removed from the actual fault. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LukasWodka · 2026-06-15T07:57:01Z

👋 Heads-up — Code review queue is at 14 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#106 — test(integration): pin XGBoost/LightGBM/CatBoost end-to-end averaging round-trips · author: @LukasWodka · no reviewer assigned
averaging-service#111 — fix(averaging): correctness round-2 sweep — N5/N7/N8/N9/N10 + load_model except (Sync develop → master for v0.3.0 release #109) · author: @LukasWodka · no reviewer assigned
client#254 — Sync develop → main for v1.7.1 chart release (egress-enforcement preflight helm test, inert) · author: @saadqbal · no reviewer assigned
client#256 — fix(egress-proxy): probe reports DNS failure distinctly, not as a TCP connect [image_validator falsely rejects images when target_size matches exactly #104] · author: @saadqbal · no reviewer assigned
client-runtime#105 — chore(ci): drop unused mysqlclient dep; report test coverage (Phase 0) · author: @LukasWodka · no reviewer assigned
client-runtime#106 — fix(schema): sync vendored ingest.v1.json — restore token_classification · author: @LukasWodka · no reviewer assigned
data-ingestors#248 — refactor(P2): extract ingestion-summary renderer into reporting.ConsoleRenderer · author: @LukasWodka · no reviewer assigned
data-ingestors#249 — fix(database): CHAR(N) must map to a SQLAlchemy type in _get_sqlalchemy_type · author: @divyasinghds · no reviewer assigned
design-system#47 — feat(table): Table compound organism (Figma 959:6061 + 986:9822) · author: @waqaskhanroghani · no reviewer assigned
tracebloc-py-package#218 — fix(upload): output-shape validation + hard-fail class-count check for text_classification · author: @shujaatTracebloc · reviewer: @aptracebloc

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Release bundles the fixes merged to develop since 0.3.9 (#230–#254): - Ingestion accounting: dropped records fail the run; JSON read-layer fails fast (#230, #234, #235) - Coercion: single source of truth for NA policy + int64 range (#236, #237) - Security: block path traversal via manifest filename/mask_id (#239) - UX papercuts: delimiter hint, NUL truncation, table-name message, Config numeric coercion (#238) - Schema/DB: real MySQL column types, CHAR(N) mapping, drop instance_segmentation from enum, empty-CSV fast fail (#240, #241, #249, #250) - DataValidator: accept single-dict / filter non-dict JSON (#232, #233) - Single-label classification caught at preflight + friendly backend reason, with the full bugbot-hardened LabelDiversityValidator (#251, #252) - CLI: schema descriptions surfaced in validation errors (#254) - Reporting: ConsoleRenderer extraction; packaging split (#248, #246) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

divyasinghds self-assigned this Jun 15, 2026

LukasWodka mentioned this pull request Jun 15, 2026

fix(ingest): catch single-label classification at preflight + surface backend reason (issue #251) #252

Merged

4 tasks

aptracebloc approved these changes Jun 15, 2026

View reviewed changes

divyasinghds merged commit acc57a4 into develop Jun 15, 2026
7 checks passed

divyasinghds deleted the fix/empty-csv-fail-fast branch June 15, 2026 08:27

divyasinghds mentioned this pull request Jun 15, 2026

Release v0.3.10 (ingestion hardening + path-traversal fix + single-label preflight) #259

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(csv): empty CSV must fail fast with clear input-error message#250

fix(csv): empty CSV must fail fast with clear input-error message#250
divyasinghds merged 1 commit into
developfrom
fix/empty-csv-fail-fast

divyasinghds commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

LukasWodka commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

divyasinghds commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The cascade

Repro

Fix

Test plan

Uh oh!

LukasWodka commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

divyasinghds commented Jun 15, 2026 •

edited by cursor Bot

Loading