fix(csv): empty CSV must fail fast with clear input-error message#250
Merged
Conversation
`read_data` caught `pd.errors.EmptyDataError` and silently returned an empty generator (just emitted a YELLOW WARNING). The ingestor then: 1. proceeded with an empty record stream 2. created an empty MySQL table 3. called `send_generate_edge_label_meta` against zero rows 4. backend 400'd with `"No data found for table X and injestor_id Y"` 5. PR #187's fail-loud handler surfaced that as `Backend rejected edge-label metadata; the dataset was NOT registered (its rows are already in the database).` — a message that BLAMES the backend and even falsely tells the user their rows are in the database, when the real cause was an empty input file. Same misleading-cascade shape #213 documents for the self-supervised + label mismatch. Fix: raise ValueError at the read layer with a source-truthful message naming the path and pointing at staging as the most-likely cause (PVC mount, helm-vs-staging timing). DataValidator's existing "No data found to validate" / preflight path catches it BEFORE any backend round-trip, so the user sees the right reason at the right layer. Updates the existing `test_read_data_empty_file_returns_nothing` (which asserted the old silent-empty behavior) to assert the new fail-fast contract instead. Renamed to `test_read_data_empty_file_raises_with_clear_message`. Surfaced by an adversarial test pass against v0.3.10-rc1: an empty CSV ended a chain that surfaced a confusing backend-blaming error several layers removed from the actual fault. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
|
👋 Heads-up — Code review queue is at 14 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
4 tasks
aptracebloc
approved these changes
Jun 15, 2026
4 tasks
divyasinghds
added a commit
that referenced
this pull request
Jun 15, 2026
Release bundles the fixes merged to develop since 0.3.9 (#230–#254): - Ingestion accounting: dropped records fail the run; JSON read-layer fails fast (#230, #234, #235) - Coercion: single source of truth for NA policy + int64 range (#236, #237) - Security: block path traversal via manifest filename/mask_id (#239) - UX papercuts: delimiter hint, NUL truncation, table-name message, Config numeric coercion (#238) - Schema/DB: real MySQL column types, CHAR(N) mapping, drop instance_segmentation from enum, empty-CSV fast fail (#240, #241, #249, #250) - DataValidator: accept single-dict / filter non-dict JSON (#232, #233) - Single-label classification caught at preflight + friendly backend reason, with the full bugbot-hardened LabelDiversityValidator (#251, #252) - CLI: schema descriptions surfaced in validation errors (#254) - Reporting: ConsoleRenderer extraction; packaging split (#248, #246) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The cascade
`CSVIngestor.read_data` caught `pd.errors.EmptyDataError` and silently returned an empty generator with only a YELLOW warning. From there:
The user-visible message blames the backend and (worse) falsely tells the user their rows are in the database, when the real cause is an empty input file. Same shape #213 documents for self-supervised + label mismatch — a fault at one layer surfaces with a misleading message several layers down.
Repro
Adversarial test F8 on v0.3.10-rc1: a zero-byte CSV produced:
```
WARNING | csv_ingestor.py:493 | Empty CSV file: /data/shared/divya-adv2/F8_empty.csv
ERROR | api/client.py:393 | Error generating edge label metadata: HTTP 400: {"table_name":"No data found for table name e2etest_adv2_f8_... and injestor_id ..."}
ERROR | base.py:985 | Error during ingestion: Backend rejected edge-label metadata; the dataset was NOT registered (its rows are already in the database).
```
— the empty-file cause is buried in a WARNING that's easy to miss, and the visible ERROR is misdirection.
Fix
Raise `ValueError` at the read layer with a source-truthful message:
```
Empty CSV file: . The file has no header and no rows. Either
stage a non-empty CSV at this path, or check the cluster-side path
(the chart mounts your PVC at /data/shared/ — confirm staging
completed before helm install).
```
DataValidator's existing "No data found to validate" / preflight path catches the ValueError BEFORE any backend round-trip, so the user sees the right reason at the right layer.
Test plan
🤖 Generated with Claude Code
Note
Low Risk
Localized CSV read-path behavior change with clearer errors; no auth, security, or broad API surface impact.
Overview
Empty zero-byte CSVs are now treated as hard input errors instead of a successful zero-row read.
CSVIngestor.read_datano longer catchespd.errors.EmptyDataErrorwith a warning and an empty generator. It raisesValueErrorwith a message that names the path and suggests staging //data/shared/checks, so users see an input problem at read time rather than a backend HTTP 400 (“No data found for table…”) after an empty table and metadata call.The test
test_read_data_empty_file_raises_with_clear_messagereplaces the old contract that expected[]and now assertspytest.raises(ValueError, match="Empty CSV file").Reviewed by Cursor Bugbot for commit 2f37a15. Bugbot is set up for automated code reviews on this repo. Configure here.