Develop to prod 0.3.10 by divyasinghds · Pull Request #258 · tracebloc/data-ingestors

divyasinghds · 2026-06-15T09:18:23Z

Summary

Type of change

Test plan

Screenshots / recordings

Deployment notes

Checklist

Tests added / updated and passing locally
Docs updated if behavior or config changed
No secrets / credentials in the diff
For security-sensitive paths: appropriate reviewer requested

Note

High Risk
Wide changes to core ingest paths, authentication-adjacent API error handling, and path-traversal security on shared storage, plus many behavior changes that alter exit codes and validation outcomes for customer jobs.

Overview
Release 0.3.10 bundles a large round of ingestion reliability, security, and operator-facing error improvements, plus structural cleanup ahead of a refactor.

Packaging & CI: Runtime deps stay in requirements.txt (explicit numpy, dev tools moved out); requirements-dev.txt pulls runtime + pytest/lint/mypy for local/CI. setup.py filters comments and -r lines from install_requires. E2e and unit workflows install requirements-dev.txt.

Templates & exit contract: All template scripts delegate to shared run_ingestion (exported from the package root), replacing duplicated ingest/exit logic that could swallow errors and exit 0 on hard failures.

Ingestion behavior: Table creation is deferred until validation passes (#260). Dropped/skipped records and invalid intent are fail-fast / counted as failures (#234). JSON ingest raises on bad records instead of silent skips; empty CSV and several CSV edge cases (NUL, wrong delimiter, int overflow, shared NA policy via coercion) are tightened. Label values are stripped before policies (#261). LabelDiversityValidator catches single-class classification at preflight (#251); prepare_dataset stashes backend error text on last_prepare_error. API client registration paths log full HTTP error bodies (not 100-char stubs). CLI schema validation surfaces rule descriptions from ingest.v1.json instead of raw jsonschema mechanics (#254).

Security: _safe_join blocks path traversal on manifest filename / mask_id for SRC/DEST file ops (#239).

Data & schema: get_table_schema maps reflected MySQL dialect types correctly for backend payloads; CHAR(N) and typo hints on unknown DDL types. instance_segmentation removed from schema/conventions; test_category_congruence guards enum ↔ validators ↔ file transfer wiring.

Tests: New e2e characterization harness over bundled modalities (MySQL rows, DEST files, API payloads). Broad new/updated unit tests; ConsoleRenderer extracted for summary output.

^{Reviewed by Cursor Bugbot for commit d2c916c. Bugbot is set up for automated code reviews on this repo. Configure here.}

…rted records (#230) Follow-up to #223, which fixed batch API-send failure accounting but was only verified end-to-end for token_classification. Auditing the other 10 modalities found two real bugs of the same silent-success class plus a diagnostics gap: 1. Five templates (image_classification, tabular_classification, tabular_regression, time_series_forecasting, time_to_event_prediction) logged and SWALLOWED exceptions in main()'s except-Exception handler. Any exception escaping ingest() — validation failure, DB error, or the fail-loud backend-registration RuntimeErrors from base.py — ended with exit code 0 and a K8s Job marked Succeeded. The "No data found for table name" cascade from the original incident would still have exited 0 in these five modalities even after #223, because it surfaces as an exception, not as failed records. Each handler now re-raises (matching the other six templates). 2. _process_batch paired the API send with zip(ids, batch), which pairs positionally and truncates to len(ids). After a MID-batch DB failure (insert_batch's per-record fallback appends successes in scan order) the DB-failed record was sent to the API — a phantom backend record pointing at no MySQL row — while the last inserted record was never sent. Now the send list is the batch minus DB failures, matched by data_id, the same rule _flush_batch already uses for accounting. 3. send_global_meta_meta, send_generate_edge_label_meta, prepare_dataset and create_dataset still had the pre-#223 str(e)[:100] truncation with no response attached — the backend's field error explaining a 400 was cut right after "HTTP 400: ". Mirrored the send_batch fix: response attached to the raised HTTPError, handler logs HTTP <status>: <body> capped at 2,000 chars. tests/test_modality_failure_accounting.py (39 tests, additive only): api-send-failure + clean-run accounting parameterized across ALL 11 template categories (the #223 tests ran category=None, skipping the file-transfer branch), an AST structural check that every template's except-Exception handler re-raises, a mid-batch mis-send regression guard, diagnostics guards for the four client methods, and a guard that create_dataset's raise escapes ingest(). Full suite: 955 passed, 1 xfailed, coverage 97% (gate: 95%). Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…#231) Addresses the review on #230: the ~20-line run-and-report block (run ingest(), log each failed record, sys.exit(1) on failures, re-raise hard errors) was inlined in all 11 template scripts. The hand-maintained copies had already drifted — five swallowed exceptions and exited 0 until #230, and the failed-record log line was wrong in three different ways (wrapper-level .get("filename") logged None for MLM/text; the tabular templates' .get("name") always logged Unknown). The contract now lives once in tracebloc_ingestor/utils/template_runner.py, exported from the package root; each template's main() builds its modality-specific ingestor and calls run_ingestion(ingestor, config.LABEL_FILE, ...). The helper also fixes the failed-record identifier: filename where the category has one, else data_id, else Unknown. The two structural template tests are migrated to pin the new invariant (every template imports + calls run_ingestion; any reintroduced except-Exception/bare-except in main() must re-raise) — their old assertions matched the inlined source pattern this change removes. The behavioral guarantees they pinned move to direct unit tests in tests/test_template_runner.py (SystemExit(1) on failed records, re-raise on hard errors, SystemExit passes the handler untouched, context-manager unwinding, identifier fallback chain). Full suite: 962 passed, 1 xfailed; template_runner.py at 100% coverage, total 97% (gate: 95%). Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…-json fix(validators): accept single-dict JSON top-level shape (issue #232)

…congruence guard (#240) instance_segmentation was a zombie category in the declarative path: the schema enum accepted it and conventions.py resolved defaults for it, but map_validators had no branch (validation trivially passed with zero checks), map_file_transfer had no branch, and the file-transfer gate in BaseIngestor._ingest_with_lock excluded it - so a customer submitting category: instance_segmentation got rows in MySQL, records at the backend API, and a registered dataset with ZERO files staged to DEST_PATH. Training then failed on missing files: the silent-half-ingest pattern from #99. Fail fast instead of half-implement: - Remove instance_segmentation from the schema enum (3 sites), from TaskCategory, and from conventions.py. Submissions now fail at validation, before any DB/network I/O, with the supported list: category: 'instance_segmentation' is not one of ['image_classification', ...] Precedent for narrowing v1 to kill a silent failure: #213 (MLM label rejection). Re-adding when real support lands is purely additive. - Unify base.py's _SRC_PATH_REQUIRED_CATEGORIES and the inline file-transfer gate list into one _FILE_BEARING_CATEGORIES set - their divergence (instance_segmentation sat in the first, not the second) is exactly how this shipped half-wired. - Add tests/test_category_congruence.py: every category the schema enum accepts must resolve a non-empty validator set, file-bearing categories (derived from _data_format_for) must equal the file-transfer gate and have an explicit map_file_transfer branch, and every image category must have file_options defaults. The next category cannot ship half-wired without tripping these. Full implementation of instance_segmentation stays blocked behind the mask-sidecar wiring for the declarative path (#136 xfail in e2e). Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

inspect(engine).get_columns() against a live MySQL returns DIALECT type classes (INTEGER, VARCHAR, FLOAT, DATETIME, DECIMAL, ...), but type_mapping was keyed by GENERIC SQLAlchemy class names (Integer, String, ...). No reflected type ever matched, so every non-VARCHAR column fell through to the "VARCHAR" default — the schema POSTed to the backend via send_global_meta_meta reported INT/FLOAT/DATETIME columns as VARCHAR. (VARCHAR columns only came out right because the fallback happened to be VARCHAR.) Match on the upper-cased class name instead: dialect class names already ARE the MySQL keywords, so the map now only translates the generic names that differ (String->VARCHAR, Integer->INT, Numeric->DECIMAL, ...) and unknown types pass through verbatim instead of being mislabelled VARCHAR. Also: - DECIMAL/NUMERIC carry precision/scale: DECIMAL(10,2) - BOOL columns (reflected by MySQL as TINYINT(1)) report as BOOLEAN - bare String no longer renders as "VARCHAR(None)" The old unit test fed generic types into a mocked inspector, so it could not see the bug; it now pins the dialect-type behaviour, and a new e2e test does declared type -> real CREATE TABLE -> real reflection on MySQL 8 — the shape that would have caught this. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…, #237) (#242) The DataValidator gate, the CSV cast, and the JSON per-record check each decided independently what a declared type permits and what counts as missing, so they drifted and disagreed — a file passed validation and then crashed mid-ingest: - #237: the validator read CSVs with pandas' default NA set (NA/null/None -> missing) but the non-tabular ingestor read with na_values=[""] only, so an "NA" token in a numeric column passed the gate then crashed the cast. - #236: an out-of-int64 value coerced to a finite float at the gate (passed) but at ingest pandas read it as object/string dtype and the overflow guard threw a cryptic numpy "ufunc 'isinf'" / "Integer out of range". New tracebloc_ingestor/utils/coercion.py centralises both decisions and is consumed by all three layers, so the gate and the ingest can't disagree: - build_csv_na_values(schema): per-column NA set applied to every schema column (numeric, date, string alike), used by BOTH the ingestor read and the validator's CSV reads with keep_default_na=False. Non-schema columns (filename, mask_id) are omitted so a file named "NA.jpg" survives. - int_range_error / int_value_overflows: object/string-dtype-safe int64 range check (coerce-then-check, never isinf/'>' on a raw object series). Out-of-range INT/BIGINT values are now rejected with a clear, actionable message ("declare the column as BIGINT") at BOTH preflight and ingest. A 26-digit value in a FLOAT column is correctly accepted (valid float). VARCHAR NA-handling is unchanged (schema string columns still treat NA/NULL as missing — the team's documented convention). Adds tests/test_coercion_consistency.py (cross-layer verdict matrix) and rewrites test_csv_na_values.py for the unified policy. Closes #236, #237. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…fast (#234, #235) (#243) Records dropped by the skip paths never reached failed_records, so run_ingestion exited 0 and the K8s Job was marked Succeeded even when rows were silently lost (#234). And JSON's read layer silently skipped a bad record while CSV aborts, so the two formats behaved inconsistently (#235). #234 — close the silent-skip channels: - A record dropped by process_record (blank/invalid unique_id, or a processing error) is now appended to failed_records, so the run exits non-zero instead of reporting clean success. - IngestionSummary.has_failures and the summary's total_failures now include skipped_records — fixing the contradiction where a run printed "0 failure(s)" while the success-rate text said "most records failed". - An invalid intent now fails fast (raises before any lock/DB work). It's a run-wide config error, not a per-row skip; previously it silently dropped every record and still exited 0. #235 — make CSV and JSON consistent: - Through the real ingest() path BOTH formats already fail-fast at the DataValidator gate on a bad cell; only JSON's read-layer diverged by silently skipping records the gate missed. _iter_validated_records now raises (fail-fast) like the CSV cast and the gate, so a bad record aborts with a clear non-zero exit instead of vanishing. Per-row tolerance (keep good rows, drop+count bad ones — the issue's "preferred direction") is intentionally deferred: it requires the DataValidator gate to stop blocking on per-cell errors, a change to the gate's role best done with the validator/registry refactor rather than bundled into this correctness fix. Closes #234. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…) (#244) The per-row `filename` and `mask_id` come straight from the user's CSV/JSON manifest and flowed unsanitised into os.path.join on both the read (SRC_PATH) and write (DEST_PATH) sides of file_transfer. A crafted value escaped both sandboxes: - An ABSOLUTE value drops the sandbox prefix entirely (os.path.join("/data/shared","images","/etc/passwd") == "/etc/passwd"). - A "../../" value walks up out of the sandbox. On the shared cluster PVC (multi-tenant / federated deployments) this is a read primitive (copy another tenant's file or a pod-mounted secret INTO the dataset, whose metadata then syncs to the platform) and an arbitrary-write primitive (land the copy outside DEST_PATH, e.g. /etc/cron.d/evil). Affects every file-bearing category. Add _safe_join(root, *parts): normalise the join with os.path.abspath and reject anything resolving outside `root`. Wire it at all seven user-influenced join sites (_find_src; the image/annotation/text/mask DEST writes; the text SRC read; _find_mask_src). A traversal/absolute value now raises a clear ValueError (caught per-record by the ingest loop, so the record is rejected and counted as a failure and the run exits non-zero); a legitimately-missing sidecar still returns None (tolerated skip). abspath (not realpath) is used so legitimately symlinked PVC mounts keep working; the threat is manifest-driven ".."/absolute injection, not attacker-controlled symlinks on the user's own staged data. tests/test_file_transfer_path_traversal.py demonstrates each read/write primitive is blocked and the legitimate basename path is unchanged. Closes #239. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Grouped low-severity fixes found during ingestor stress-testing; each small and independent. 1. Wrong delimiter -> misleading error. A ;/tab/pipe-delimited file (European Excel export, or a TSV saved as .csv) parses as a single column and the error was "Schema columns not present in CSV: <every column>". When the file parsed as one column whose header still holds the delimiter, CSVIngestor._validate_csv now hints at the real delimiter. 2. NUL byte silently truncates a field. pandas' C parser truncates "a\x00b" to "a" at read time — silent corruption. The NUL is valid UTF-8 so it slipped past the encoding preflight; _check_csv_encoding now also rejects NUL bytes up front with a clear message (single raw-byte scan, runs before read_data). 3. TableNameValidator message/impl mismatch. The compiled regex forbade a leading underscore while the error message AND valid_pattern metadata both said it was allowed, so "_tmp" was rejected with a message claiming it was valid. MySQL permits a leading underscore; the regex is relaxed to ^[a-zA-Z_][a-zA-Z0-9_]*$ to agree with the (correct) message/metadata. 4. Config numeric env coercion. A non-numeric MYSQL_PORT / BATCH_SIZE raised a raw "invalid literal for int()" at property access; Config now raises a clear "<FIELD> must be an integer" naming the env var to fix. Closes #238. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…#246) `pip install tracebloc_ingestor` and the cosign-signed Docker image both pulled in pytest, black, flake8, mypy, isort and twine, because setup.py fed the entire requirements.txt (which mixed runtime + dev tooling) into install_requires — and even fed the section-comment lines verbatim. - requirements.txt is now runtime-only and declares numpy explicitly (it's imported directly by csv_ingestor / data_validator; was relied on only transitively via pandas). - requirements-dev.txt (new) carries the test / lint / type / release tooling and `-r requirements.txt`. - setup.py filters blank / comment / -r lines when building install_requires. - tests.yml / e2e.yml install requirements-dev.txt. Verified: the built wheel's Requires-Dist is exactly the 13 runtime deps, no dev tooling. The signed image (Dockerfile installs requirements.txt) and PyPI consumers no longer ship test/lint tooling. Publish workflows are unaffected (they install setuptools/wheel/twine directly, not via requirements.txt). Full suite green (962 passed, 97% coverage). First step of the data-ingestors refactor (packaging hygiene); the structural phases follow once the in-flight fix PRs merge. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

… modality (#247) Adds e2e/test_characterization.py — the safety net for the upcoming structural refactor. For each bundled template dataset it runs the real engine into real MySQL and pins three observable dimensions the refactor must preserve: 1. MySQL rows — count == source manifest; standard-column semantics (data_intent, single non-null ingestor_id, unique non-null data_id); feature-value round-trip for the tabular family (catches type corruption). 2. DEST_PATH file manifest — the sidecar files copied for file-bearing categories (catches the insert-rows-but-copy-no-files silent-half-ingest class). 3. Backend payloads — the records + metadata handed to the APIClient, captured by spying on the APIClient method args (CLIENT_ENV=local short-circuits the HTTP call before the payload is serialised). Covers all 9 cleanly-ingesting modalities. Expectations are DERIVED from the source files (row counts, sidecar listings, schema) — no hardcoded magic values — so the harness stays honest if a template changes. It characterises CLEAN-input behavior, which the in-flight fix PRs (#242-#245) don't change (they only touch malformed-input handling), so the goldens hold on develop today and become the contract the refactor is checked against. Building it already surfaced two behaviors worth pinning: MySQL FLOAT is 32-bit (so a ~1e-6 round-trip delta is correct, not corruption), and the label column is mapped onto the standard `label` column (so it's absent from the physical feature schema). Runs in the e2e job (real MySQL); auto-skipped in the unit suite when no MySQL is reachable (existing e2e/conftest mechanism). Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…my_type (#249) `DataValidator._validate_char` accepts CHAR(N) — the validator vocabulary includes CHAR, with explicit length enforcement and NULL tolerance just like VARCHAR. But `_get_sqlalchemy_type` in the DDL layer had no CHAR entry, so a user schema like `code: CHAR(1)` passed every preflight validator and then died at table creation with: ValueError: Unsupported MySQL type: CHAR(1) — the two layers had divergent type vocabularies, exactly the class of mismatch #192 already fixed for NUMERIC. Add CHAR to `type_mapping` (imported from sqlalchemy core), so: - `CHAR(N)` → CHAR(length=N) — fixed-width, padded per MySQL's CHAR semantics - bare `CHAR` → CHAR class, dialect picks the default length (1) CHAR is functionally close to VARCHAR but semantically distinct (MySQL right-pads CHAR to its declared length on storage), so using the SQLAlchemy `CHAR` type rather than aliasing to `String` preserves the declared intent. Adds regression tests: - CHAR(1) → CHAR instance with length == 1 - bare CHAR → CHAR (class or instance, dialect-defaulted) Surfaced by an adversarial all-types tabular schema run against v0.3.10-rc1: a CSV column declared `CHAR(1)` failed table creation despite every type in the schema being documented as supported. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`read_data` caught `pd.errors.EmptyDataError` and silently returned an empty generator (just emitted a YELLOW WARNING). The ingestor then: 1. proceeded with an empty record stream 2. created an empty MySQL table 3. called `send_generate_edge_label_meta` against zero rows 4. backend 400'd with `"No data found for table X and injestor_id Y"` 5. PR #187's fail-loud handler surfaced that as `Backend rejected edge-label metadata; the dataset was NOT registered (its rows are already in the database).` — a message that BLAMES the backend and even falsely tells the user their rows are in the database, when the real cause was an empty input file. Same misleading-cascade shape #213 documents for the self-supervised + label mismatch. Fix: raise ValueError at the read layer with a source-truthful message naming the path and pointing at staging as the most-likely cause (PVC mount, helm-vs-staging timing). DataValidator's existing "No data found to validate" / preflight path catches it BEFORE any backend round-trip, so the user sees the right reason at the right layer. Updates the existing `test_read_data_empty_file_returns_nothing` (which asserted the old silent-empty behavior) to assert the new fail-fast contract instead. Renamed to `test_read_data_empty_file_raises_with_clear_message`. Surfaced by an adversarial test pass against v0.3.10-rc1: an empty CSV ended a chain that surfaced a confusing backend-blaming error several layers removed from the actual fault. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…onsoleRenderer (#248) Structural refactor phase P2 (backend#796) — pull presentation out of the ingestion logic. The summary box (success rate, per-channel counts, status banner) was ~114 lines of print + ANSI + emoji living inside the 1,196-line BaseIngestor god class. - New tracebloc_ingestor/reporting.py with ConsoleRenderer.render_summary — the single home for that presentation. The body moved VERBATIM, so the customer-facing output is byte-for-byte identical. - BaseIngestor._log_summary becomes a thin delegate (kept as a method so the existing callers/tests that reference BaseIngestor._log_summary keep working). base.py: -119/+10 lines. - Drops the now-unused BLUE import from base.py. - IngestionSummary stays the pure data object the renderer consumes; the renderer type-hints it via TYPE_CHECKING to avoid a runtime import cycle. Payoff: the presentation is now unit-testable without a DB or a full ingest — tests/test_reporting.py (8 cases, reporting.py 100% covered) locks the output contract, including #234's "dropped records count toward the failure total / disqualify the success banner" behavior. Behaviour-preserving: full unit suite 1026 passed (97.3% coverage); the characterization harness goldens (#247: DB cells / DEST_PATH manifest / backend payloads) are unaffected — they don't assert the summary box — and the CI e2e job re-confirms against real MySQL. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

… backend reason (issue #251) (#252) * fix(ingest): catch single-label classification at preflight + surface backend reason (issue #251) Two coupled improvements to the user experience when a classification dataset has fewer than 2 distinct label values — surfaced by adversarial testing against v0.3.10-rc1. ## The cascade we hit Submit a `tabular_classification` (or any classification-family) dataset where the `label` column has only ONE distinct value across all rows (e.g. a 10-row test fixture all labelled `"X"`): 1. Validators pass — no per-record issue, just label uniformity. 2. All rows insert into MySQL successfully. 3. `prepare_dataset` calls `/global_meta/prepare/`. 4. Backend returns HTTP 400: `{"message":"Please provide atleast 2 labels."}` 5. The framework raises a generic `"Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above."` The user-visible error says "see the logged API error above" — but the ACTUAL reason ("Please provide atleast 2 labels") is one ERROR layer up and easy to miss. Rows are orphaned in MySQL by the time anything surfaces. ## Fix 1 — preflight validator (the right layer) New `LabelDiversityValidator` in `tracebloc_ingestor/validators/label_diversity_validator.py`: - Reads ONLY the label column when given a CSV path (usecols=[...] — a multi-GB proteomics panel doesn't get materialised to count labels). - Counts non-null distinct values; rejects when < min_distinct (default 2 — matching the backend's exact contract). - Error message NAMES the offending distinct value(s) and the value-count breakdown so the user immediately sees what's wrong with the input ("1 distinct value(s): ['X']. Value counts: {'X': 10}"). - Suggests the regression-family categories as the right home for a single-target dataset, so users with continuous targets get a pointer instead of being told to fake a second class. - Gracefully no-ops on empty input or a missing label column — other validators already surface those clearly; don't double-report. Wired in via `validators_mapping.py` for every classification-family category: tabular_classification, image_classification, text_classification, object_detection, semantic_segmentation, and keypoint_detection. NOT wired for regression (tabular_regression / time_series_forecasting / time_to_event_prediction) — continuous targets legitimately have "1 distinct value" semantics — nor for masked_language_modeling / token_classification (self-supervised or per-token labels). ## Fix 2 — propagate backend body into the user-visible error `APIClient.prepare_dataset`'s error handler now stashes the formatted backend response (`HTTP <status>: <body>`) on `self.last_prepare_error`. `base.py`'s "Backend failed to prepare" RuntimeError reads that and includes the actual backend reason verbatim in the user-visible message. Before: Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). See the logged API error above. After: Backend failed to prepare the dataset; it was NOT registered (its rows are already in the database). Backend response: HTTP 400: {"message":"Please provide atleast 2 labels."} A user piping the log through filtered output (CI, error tooling, paste- into-Slack) sees the cause in the visible line — no more grep required. ## Tests `tests/test_label_diversity_validator.py` (14 cases): - Positive: 2+ labels, many labels, nulls don't count, case-insensitive column match - Failure: single label (with value-counts in message), all-null, one-value-plus-nulls - Error mentions the regression alternative for single-target users - Defensive: empty input / missing column → silent pass (other validators own those errors) - CSV-path: reads only the label column (no full-table OOM risk), rejects single-label CSV at the file level - Custom label column name Extended `tests/test_api_client_methods.py`: - On HTTP error, `last_prepare_error` populated with status + body - On a clean ingestor, `last_prepare_error is None` - On network error (no response), still populated with the stringified exception (Backend-side typo "atleast" → "at least" lives in the backend repo — flagged in #251 for whoever maintains that endpoint.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(validators): label-diversity loader fails closed + uses pandas header parse (bugbot #252) Two bugbot findings on the label-diversity preflight (#251): - High: _load_data swallowed read errors and returned None, which validate treats as empty → valid. A single-label CSV whose read failed could skip the gate and hit the backend rejection. Read errors now propagate to validate's handler and fail the check (fail-closed). - Medium: header detection used a naive header_line.split(",") that diverges from pandas on quoted/alternate-delimiter headers. On a miss it fell back to nrows=1 and counted distinct labels on one row, rejecting a diverse dataset. Now resolves the column against pandas' own header parse (nrows=0); a genuinely-absent column returns None (benign skip). Also update test_validators_mapping for the LabelDiversityValidator now in the classification chains (was a stale pre-existing failure on this branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(label-diversity): address Cursor Bugbot follow-up on PR #252 - prepare_dataset clears `last_prepare_error` up front so an early `return False` (local mode / invalid category) can't leave a stale message that base.py then attaches to an unrelated failure. - LabelDiversityValidator now accepts the schema so the label column is read with the same NA / dtype rules as CSVIngestor / DataValidator. Without it, the distinct-label count could disagree with what actually ingests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(label-diversity): cover NA-alignment with ingest read rules (bugbot #252) Tests for the follow-up fixes in 1539301: - schema label: NA sentinels ("null"/"NA") read as missing → single-label correctly flagged - non-schema label: "NA"/"null" kept as genuine classes (matches ingestor) - string-schema label: numeric-looking values ("01","1","1.0") not collapsed by numeric inference Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(label-diversity): use full schema for label NA rules, not stripped (bugbot #252) base.py deletes the label/annotation/id columns from file_options["schema"] (they're framework columns, not table columns), but CSVIngestor reads the file with NA/dtype rules from the FULL self.schema — so the label column DOES get NA-sentinel treatment at ingest. The label-diversity validator was wired with the stripped schema, so _schema_type_for(label) always returned None and the NA/dtype rules never applied to the label: "null"/"NA"/"" inflated the distinct count at preflight while ingestion treated them as missing, letting an effectively single-class dataset slip past the gate and fail at backend prepare. base.py now also passes the unstripped schema as `full_schema`; the validator factory prefers it (falling back to `schema` for direct/test callers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(label-diversity): resolve label column whitespace-insensitively (bugbot #252) CSVIngestor strips column-name whitespace on read (chunk.columns.str.strip()), so a header like " label " is ingested as `label`. The validator resolved the label column against the raw header, treated " label " as missing, skipped the diversity check with a warning, and a single-class CSV passed preflight only to fail at backend prepare. _resolve_column and _schema_type_for now strip (and lowercase) on both sides, matching the ingestor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… raw JSON-schema mechanic (#254) * fix(cli): surface schema descriptions in validation errors instead of raw mechanic The ingest.yaml schema (``schema/ingest.v1.json``) carries beautifully written ``description`` fields explaining each ``allOf`` rule's intent, rationale, and (where applicable) a fix hint — e.g. for the self-supervised + label rule: "Self-supervised categories MUST NOT set `label`. The shipped CSV has no label column, and the framework registers no edge-label metadata for them. Setting `label` anyway used to ingest rows successfully, then crash at backend registration with a misleading HTTP 400 'No data found' (issue #213). Reject at submission instead." But ``_format_errors`` only emitted the raw ``ValidationError.message``, which for a ``not`` / ``oneOf`` / ``allOf`` rule reads: <root>: {'apiVersion': 'tracebloc.io/v1', ..., 'label': 'label'} should not be valid under {'required': ['label']} — technically correct, but it dumps the entire submission as a Python dict and uses JSON-schema vocabulary the customer didn't write. A user who copy-pasted ``label: label`` from another category's YAML can't tell from the mechanic that MLM is self-supervised, or how to fix it. Surfaced by adversarial testing: when the user-visible error fires for this rule, the schema's helpful description never reaches the user. ## Fix New helper ``_describe_from_schema_path(schema, absolute_schema_path)`` walks the schema from root toward the failing rule, capturing the deepest ``description`` field encountered along the way (descriptions are inherited by sub-trees per JSON-schema convention; the deepest one is the most specific). ``_format_errors`` uses the helper when a description is available: <root>: Self-supervised categories MUST NOT set `label`. The shipped CSV has no label column, and the framework registers no edge-label metadata for them. Setting `label` anyway used to ingest rows successfully, then crash at backend registration with a misleading HTTP 400 'No data found' (issue #213). Reject at submission instead. (rule: not={'required': ['label']}) The mechanic still appears on a follow-up line for power-user debugging without burying it in the headline. For primitive checks that already have decent messages (``'label' is a required property``, enum violations), we keep ``e.message`` unchanged — descriptions only attach when one exists in the schema. ## Test New ``test_schema_violation_surfaces_description_not_raw_mechanic`` in ``tests/test_cli_run.py``: - Submits MLM with ``label: label`` - Asserts the user-visible error contains the schema's description keywords ("Self-supervised", "MUST NOT set `label`") - Asserts the rule mechanic still appears on a follow-up line - Asserts the old behavior (full YAML dict dumped in the headline) is NOT present - Asserts no DB / network I/O happens before validation rejects Existing ``test_schema_violation_fails_fast`` (missing required fields) continues to pass — primitive ``"X is a required property"`` messages don't have descriptions in the schema and fall through to the original ``e.message`` path unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cli): description walker must skip the schema's root description (bugbot #254) Cursor Bugbot caught a real bug in the previous commit: the walker captured the root schema's ``description`` on its first iteration — before descending — and the root description in ``ingest.v1.json`` is a generic blurb about the schema as a whole ("Declarative configuration for the tracebloc data ingestor..."). The result: EVERY validation error returned a non-None description, so ``_format_errors`` replaced jsonschema's clear primitive messages ("'label' is a required property") with that generic prose plus a confusing ``(rule: required=...)`` line — contradicting the very fallback path the previous commit was supposed to preserve. Fix: walk step-by-step, only capturing descriptions on nodes we DESCEND INTO. The root is never captured. ``allOf`` branches, sub-property definitions, and other inner nodes are — those are the descriptions authored to explain specific rules. ## Verifying Two tests pin the fix: 1. Updated ``test_schema_violation_fails_fast`` (the missing-required- field case) now asserts: - ``"is a required property"`` IS in the output (the primitive message survives) - ``"Declarative configuration"`` is NOT in the output (root description does not leak) 2. New unit test ``test_describe_from_schema_path_skips_root_description`` on the walker itself: - Primitive ``["required"]`` path → returns None (no inner desc) - Generic property-type path → returns None - ``allOf`` branch path → returns the rule-specific description The MLM+label end-to-end test (``test_schema_violation_surfaces_description_not_raw_mechanic``) keeps passing — the rule-specific description is still surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LukasWodka · 2026-06-15T09:19:33Z

👋 Heads-up — Code review queue is at 15 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#106 — test(integration): pin XGBoost/LightGBM/CatBoost end-to-end averaging round-trips · author: @LukasWodka · no reviewer assigned
averaging-service#111 — fix(averaging): correctness round-2 sweep — N5/N7/N8/N9/N10 + load_model except (Sync develop → master for v0.3.0 release #109) · author: @LukasWodka · no reviewer assigned
client#254 — Sync develop → main for v1.7.1 chart release (egress-enforcement preflight helm test, inert) · author: @saadqbal · no reviewer assigned
client-runtime#105 — chore(ci): drop unused mysqlclient dep; report test coverage (Phase 0) · author: @LukasWodka · no reviewer assigned
client-runtime#106 — fix(schema): sync vendored ingest.v1.json — restore token_classification · author: @LukasWodka · no reviewer assigned
data-ingestors#253 — refactor(P3a): move per-category behavior flags into a ModalityRegistry · author: @LukasWodka · no reviewer assigned
data-ingestors#255 — refactor(P3b): move validator factories into the ModalityRegistry · author: @LukasWodka · no reviewer assigned
data-ingestors#256 — refactor(P3c): move sidecar-transfer factories into the ModalityRegistry · author: @LukasWodka · no reviewer assigned
data-ingestors#257 — refactor(P3d): source data_format from the ModalityRegistry · author: @LukasWodka · no reviewer assigned
design-system#47 — feat(table): Table compound organism (Figma 959:6061 + 986:9822) · author: @waqaskhanroghani · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Release bundles the fixes merged to develop since 0.3.9 (#230–#254): - Ingestion accounting: dropped records fail the run; JSON read-layer fails fast (#230, #234, #235) - Coercion: single source of truth for NA policy + int64 range (#236, #237) - Security: block path traversal via manifest filename/mask_id (#239) - UX papercuts: delimiter hint, NUL truncation, table-name message, Config numeric coercion (#238) - Schema/DB: real MySQL column types, CHAR(N) mapping, drop instance_segmentation from enum, empty-CSV fast fail (#240, #241, #249, #250) - DataValidator: accept single-dict / filter non-dict JSON (#232, #233) - Single-label classification caught at preflight + friendly backend reason, with the full bugbot-hardened LabelDiversityValidator (#251, #252) - CLI: schema descriptions surfaced in validation errors (#254) - Reporting: ConsoleRenderer extraction; packaging split (#248, #246) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s duplication (issue #261) (#262) A new-user CSV with ``" A "`` (whitespace-padded) mixed with ``"A"`` landed in MySQL as TWO distinct classes — the ingestor reported ``🎉 Ingestion completed successfully!``, the LabelDiversityValidator happily passed (2 distinct values ≥ 2), and a model trained on the output learned 3 classes instead of the 2 the user intended. Silent label-set corruption. The framework already strips column HEADERS (``chunk.columns.str.strip()`` in csv_ingestor) and the ``data_id`` column (line above in process_record). Extending the same convention to label-column VALUES is internally consistent — a label is a class identifier, and class identifiers don't carry whitespace semantics. (Other VARCHAR columns — free-text descriptions, names, etc. — are left alone; only the label column gets this treatment.) ## Two changes ### 1. Write-side strip — ``BaseIngestor.process_record`` Strip ``label_val.strip()`` for string labels before ``label_policy.apply`` runs. Numeric labels (INT class IDs) and other non-strings pass through unchanged. ### 2. Preflight WARNING — ``LabelDiversityValidator`` Collapse whitespace-equivalent values BEFORE counting distinct, AND surface a WARNING listing the pre-strip variants when collapsing changed anything. The validator and the write side now use the same distinct count, so a dataset with ``[" A ", "A", "B"]``: - Validator: distinct=2 (A, B), warning lists ``{"A": [" A ", "A"]}`` - Ingest: stores ``A`` and ``B`` — 2 classes in MySQL, as intended A dataset where ALL variants collapse to a single class (``[" A ", "A", " A"]``) still fails diversity: - Validator: distinct=1, rejects with a message that explicitly mentions whitespace stripping so the user knows what the validator did ## Tests ``test_label_diversity_validator.py`` (2 new): - ``test_whitespace_duplicates_collapsed_and_warned`` — mixed dataset passes; warning names the variants - ``test_whitespace_only_single_value_still_fails`` — all-collapse case rejects with a clear message ``test_ingestor_base.py`` (3 new): - ``test_process_record_strips_whitespace_from_string_label`` — ``" A "`` → ``"A"`` - ``test_process_record_label_strip_makes_whitespace_variants_equivalent`` — two records with whitespace variants produce identical cleaned labels - ``test_process_record_label_strip_preserves_non_string_labels`` — INT 42 passes through unchanged ## Surfaced by Adversarial new-user test pass against v0.3.10-rc2 (scenario N15): shipped CSV had ``" A "`` from an Excel paste, MySQL ended up with 3 distinct classes ``{" A ", "A", "B"}`` and the ingestor reported success. ## Related - #251: LabelDiversityValidator caught the inverse case (1 distinct). This fixes the silent-but-more-distinct sibling: validator accepts because the strings are bytewise distinct, but the user has fewer semantic classes than that. - The whitespace strip on column HEADERS in csv_ingestor (already in develop) establishes the convention; this extends it to label values. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Newline-delimited JSON (JSONL) is a common export format from log / event pipelines (Datadog, Splunk, Kafka, ...) — one JSON object per line, no top-level array. ``json.load`` rejects JSONL with the opaque ``json.JSONDecodeError: Extra data`` (Python 3.12+) or ``Trailing data`` (earlier 3.x), which our preflight surfaced verbatim: Label-diversity validation error: Trailing data A new user looking at that error has no idea (a) that their file is JSONL, (b) that JSONL isn't a supported input format, or (c) how to fix it. ## Fix In ``DataValidator._load_data``'s .json branch, wrap ``json.load`` in a try/except. On ``JSONDecodeError`` whose message says "Extra data" or "Trailing data", run a JSONL probe: if the file's first non-empty line itself parses as JSON, we have JSONL → raise a clear ValueError naming the pattern and the fix: <path> looks like newline-delimited JSON (JSONL) — one JSON object per line. JSONL isn't supported as an input format; the ingestor expects a top-level JSON array of records or a single object. Combine the records into an array (wrap with ``[...]`` and join with ``,``) and re-ingest. The detection is conservative — both the original error class AND the first-line-parses check must hold. A genuinely malformed JSON file (e.g. ``{trailing garbage``) doesn't trigger the JSONL message because its first line doesn't parse standalone. Also adjust ``_load_data``'s outer ``except Exception``: let ``ValueError`` propagate (so this fix message reaches the validator's result), but keep swallowing other I/O exceptions as before. ## Tests - ``test_loads_from_json_jsonl_surfaces_clear_error``: JSONL input → error mentions JSONL + the fix (combine into array) - ``test_loads_from_json_malformed_not_misidentified_as_jsonl``: genuinely broken JSON falls through to the generic error path ## Surfaced by Adversarial new-user test pass against v0.3.10-rc2, scenario N12. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A new user typing ``BIGINTEGER`` (mixing MySQL's BIGINT with Python's ``int``), ``BOLEAN`` (missing an O), ``NUMRIC`` (missing an E), etc. used to get a bare ``Unsupported MySQL type: <typo>`` error with no hint about what's actually supported. They'd have to spelunk the schema docs or guess. ## Fix Add a Levenshtein-based "Did you mean X?" hint to ``_get_sqlalchemy_type``'s ValueError. The threshold (d ≤ 3) is the empirical sweet spot: - close enough to catch every realistic typo we've seen in adversarial testing (single-letter swap, missing letter, repeated letter, common prefix/suffix confusion) - wide enough to fail-silently on genuine unrelated types (a real ``GEOMETRY`` request stays "Unsupported" with no misleading hint) The error ALSO prints the full supported-type list so: - the suggestion gets the user 80% of the way there - the list catches the cases where the closest-by-distance suggestion isn't the closest-by-intent (e.g. BIGINTEGER's nearest by edits is INTEGER (d=3), even though the user clearly wanted BIGINT (d=4) — the printed list has BIGINT right next to INTEGER) ## Tests - ``test_get_sqlalchemy_type_typo_suggests_correction``: BIGINTEGER surfaces the INTEGER suggestion + the BIGINT alternative is in the supported list - ``test_get_sqlalchemy_type_typo_no_suggestion_for_distant_input``: GEOMETRY (d=6 to nearest) gets no hint, only the supported list - Parametrised cases over realistic typos: INTGER→INTEGER, NUMRIC→NUMERIC, BOLEAN→BOOLEAN, VARCAHR→VARCHAR ## Surfaced by Adversarial new-user test pass against v0.3.10-rc2, scenario N3: ``BIGINTEGER`` got a bare unhelpful error. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…265) A validator-rejected ingest used to leave an empty orphaned table behind: `create_table` fired inside `__init__`, BEFORE `validate_data()` ran. On the next retry the stale-table guard tripped and the user had to manually DROP before re-running. Move the `create_table` call out of `__init__` into `_ingest_with_lock`, right after `validate_data()` accepts the input. The cleaned schema is stashed on the ingestor at construction time and consumed on first ingest. `self.table` starts as `None` and is only populated once the table actually exists. Regression tests: - Construction alone does not call `create_table` and leaves `self.table` as `None`. - A validator-rejected ingest never calls `create_table`, so no orphan can remain for the next run to trip on. - The happy path still creates the table exactly once, with the cleaned schema. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shujaatTracebloc and others added 16 commits June 12, 2026 09:08

Merge pull request #233 from tracebloc/fix/data-validator-single-dict…

4171fb0

…-json fix(validators): accept single-dict JSON top-level shape (issue #232)

LukasWodka mentioned this pull request Jun 15, 2026

Release v0.3.10 (ingestion hardening + path-traversal fix + single-label preflight) #259

Merged

4 tasks

divyasinghds changed the title ~~Develop to prod~~ Develop to prod 0.3.10 Jun 15, 2026

aptracebloc previously approved these changes Jun 15, 2026

View reviewed changes

saadqbal previously approved these changes Jun 15, 2026

View reviewed changes

divyasinghds dismissed stale reviews from saadqbal and aptracebloc via cab8514 June 15, 2026 11:09

divyasinghds and others added 3 commits June 15, 2026 16:40

aptracebloc approved these changes Jun 15, 2026

View reviewed changes

divyasinghds merged commit c2c66ed into master Jun 15, 2026
11 of 12 checks passed

saadqbal mentioned this pull request Jun 15, 2026

Release v0.3.10 (ingestion hardening + path-traversal fix + single-label preflight) #268

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop to prod 0.3.10#258

Develop to prod 0.3.10#258
divyasinghds merged 21 commits into
masterfrom
develop

divyasinghds commented Jun 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

LukasWodka commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

divyasinghds commented Jun 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related

Type of change

Test plan

Screenshots / recordings

Deployment notes

Checklist

Uh oh!

LukasWodka commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

divyasinghds commented Jun 15, 2026 •

edited by cursor Bot

Loading