Skip to content

fix(csv): empty CSV must fail fast with clear input-error message#250

Merged
divyasinghds merged 1 commit into
developfrom
fix/empty-csv-fail-fast
Jun 15, 2026
Merged

fix(csv): empty CSV must fail fast with clear input-error message#250
divyasinghds merged 1 commit into
developfrom
fix/empty-csv-fail-fast

Conversation

@divyasinghds

@divyasinghds divyasinghds commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

The cascade

`CSVIngestor.read_data` caught `pd.errors.EmptyDataError` and silently returned an empty generator with only a YELLOW warning. From there:

  1. Ingestor proceeded with an empty record stream
  2. Created an empty MySQL table
  3. Called `send_generate_edge_label_meta` against zero rows
  4. Backend returned HTTP 400: `{"table_name":"No data found for table X and injestor_id Y"}`
  5. PR fix: fail loud when backend dataset registration fails (no silent half-ingest) #187's fail-loud handler surfaced that as: `Backend rejected edge-label metadata; the dataset was NOT registered (its rows are already in the database).`

The user-visible message blames the backend and (worse) falsely tells the user their rows are in the database, when the real cause is an empty input file. Same shape #213 documents for self-supervised + label mismatch — a fault at one layer surfaces with a misleading message several layers down.

Repro

Adversarial test F8 on v0.3.10-rc1: a zero-byte CSV produced:

```
WARNING | csv_ingestor.py:493 | Empty CSV file: /data/shared/divya-adv2/F8_empty.csv
ERROR | api/client.py:393 | Error generating edge label metadata: HTTP 400: {"table_name":"No data found for table name e2etest_adv2_f8_... and injestor_id ..."}
ERROR | base.py:985 | Error during ingestion: Backend rejected edge-label metadata; the dataset was NOT registered (its rows are already in the database).
```

— the empty-file cause is buried in a WARNING that's easy to miss, and the visible ERROR is misdirection.

Fix

Raise `ValueError` at the read layer with a source-truthful message:

```
Empty CSV file: . The file has no header and no rows. Either
stage a non-empty CSV at this path, or check the cluster-side path
(the chart mounts your PVC at /data/shared/ — confirm staging
completed before helm install).
```

DataValidator's existing "No data found to validate" / preflight path catches the ValueError BEFORE any backend round-trip, so the user sees the right reason at the right layer.

Test plan

  • Updated `test_read_data_empty_file_raises_with_clear_message` (renamed from `..._returns_nothing` to reflect the new contract) asserts the new fail-fast behavior with `pytest.raises(ValueError, match="Empty CSV file")`
  • Existing parser-error path (ragged rows, etc.) still propagates as before
  • CI green
  • Re-run F8 against rc2 → empty-CSV error surfaces at preflight, no backend call

🤖 Generated with Claude Code


Note

Low Risk
Localized CSV read-path behavior change with clearer errors; no auth, security, or broad API surface impact.

Overview
Empty zero-byte CSVs are now treated as hard input errors instead of a successful zero-row read.

CSVIngestor.read_data no longer catches pd.errors.EmptyDataError with a warning and an empty generator. It raises ValueError with a message that names the path and suggests staging / /data/shared/ checks, so users see an input problem at read time rather than a backend HTTP 400 (“No data found for table…”) after an empty table and metadata call.

The test test_read_data_empty_file_raises_with_clear_message replaces the old contract that expected [] and now asserts pytest.raises(ValueError, match="Empty CSV file").

Reviewed by Cursor Bugbot for commit 2f37a15. Bugbot is set up for automated code reviews on this repo. Configure here.

`read_data` caught `pd.errors.EmptyDataError` and silently returned an
empty generator (just emitted a YELLOW WARNING). The ingestor then:
  1. proceeded with an empty record stream
  2. created an empty MySQL table
  3. called `send_generate_edge_label_meta` against zero rows
  4. backend 400'd with `"No data found for table X and injestor_id Y"`
  5. PR #187's fail-loud handler surfaced that as
     `Backend rejected edge-label metadata; the dataset was NOT
     registered (its rows are already in the database).`

— a message that BLAMES the backend and even falsely tells the user
their rows are in the database, when the real cause was an empty
input file. Same misleading-cascade shape #213 documents for the
self-supervised + label mismatch.

Fix: raise ValueError at the read layer with a source-truthful message
naming the path and pointing at staging as the most-likely cause
(PVC mount, helm-vs-staging timing). DataValidator's existing
"No data found to validate" / preflight path catches it BEFORE any
backend round-trip, so the user sees the right reason at the right
layer.

Updates the existing `test_read_data_empty_file_returns_nothing` (which
asserted the old silent-empty behavior) to assert the new fail-fast
contract instead. Renamed to
`test_read_data_empty_file_raises_with_clear_message`.

Surfaced by an adversarial test pass against v0.3.10-rc1: an empty CSV
ended a chain that surfaced a confusing backend-blaming error several
layers removed from the actual fault.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@divyasinghds divyasinghds self-assigned this Jun 15, 2026
@LukasWodka

Copy link
Copy Markdown
Collaborator

👋 Heads-up — Code review queue is at 14 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@divyasinghds divyasinghds merged commit acc57a4 into develop Jun 15, 2026
7 checks passed
@divyasinghds divyasinghds deleted the fix/empty-csv-fail-fast branch June 15, 2026 08:27
divyasinghds added a commit that referenced this pull request Jun 15, 2026
Release bundles the fixes merged to develop since 0.3.9 (#230#254):
- Ingestion accounting: dropped records fail the run; JSON read-layer fails
  fast (#230, #234, #235)
- Coercion: single source of truth for NA policy + int64 range (#236, #237)
- Security: block path traversal via manifest filename/mask_id (#239)
- UX papercuts: delimiter hint, NUL truncation, table-name message, Config
  numeric coercion (#238)
- Schema/DB: real MySQL column types, CHAR(N) mapping, drop
  instance_segmentation from enum, empty-CSV fast fail (#240, #241, #249, #250)
- DataValidator: accept single-dict / filter non-dict JSON (#232, #233)
- Single-label classification caught at preflight + friendly backend reason,
  with the full bugbot-hardened LabelDiversityValidator (#251, #252)
- CLI: schema descriptions surfaced in validation errors (#254)
- Reporting: ConsoleRenderer extraction; packaging split (#248, #246)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants