Skip to content

Develop to prod 0.3.10#258

Merged
divyasinghds merged 21 commits into
masterfrom
develop
Jun 15, 2026
Merged

Develop to prod 0.3.10#258
divyasinghds merged 21 commits into
masterfrom
develop

Conversation

@divyasinghds

@divyasinghds divyasinghds commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Related

Type of change

  • Feature
  • Bug fix
  • Tech-debt / refactor
  • Docs
  • Security / hardening
  • Breaking change

Test plan

Screenshots / recordings

Deployment notes

Checklist

  • Tests added / updated and passing locally
  • Docs updated if behavior or config changed
  • No secrets / credentials in the diff
  • For security-sensitive paths: appropriate reviewer requested

Note

High Risk
Wide changes to core ingest paths, authentication-adjacent API error handling, and path-traversal security on shared storage, plus many behavior changes that alter exit codes and validation outcomes for customer jobs.

Overview
Release 0.3.10 bundles a large round of ingestion reliability, security, and operator-facing error improvements, plus structural cleanup ahead of a refactor.

Packaging & CI: Runtime deps stay in requirements.txt (explicit numpy, dev tools moved out); requirements-dev.txt pulls runtime + pytest/lint/mypy for local/CI. setup.py filters comments and -r lines from install_requires. E2e and unit workflows install requirements-dev.txt.

Templates & exit contract: All template scripts delegate to shared run_ingestion (exported from the package root), replacing duplicated ingest/exit logic that could swallow errors and exit 0 on hard failures.

Ingestion behavior: Table creation is deferred until validation passes (#260). Dropped/skipped records and invalid intent are fail-fast / counted as failures (#234). JSON ingest raises on bad records instead of silent skips; empty CSV and several CSV edge cases (NUL, wrong delimiter, int overflow, shared NA policy via coercion) are tightened. Label values are stripped before policies (#261). LabelDiversityValidator catches single-class classification at preflight (#251); prepare_dataset stashes backend error text on last_prepare_error. API client registration paths log full HTTP error bodies (not 100-char stubs). CLI schema validation surfaces rule descriptions from ingest.v1.json instead of raw jsonschema mechanics (#254).

Security: _safe_join blocks path traversal on manifest filename / mask_id for SRC/DEST file ops (#239).

Data & schema: get_table_schema maps reflected MySQL dialect types correctly for backend payloads; CHAR(N) and typo hints on unknown DDL types. instance_segmentation removed from schema/conventions; test_category_congruence guards enum ↔ validators ↔ file transfer wiring.

Tests: New e2e characterization harness over bundled modalities (MySQL rows, DEST files, API payloads). Broad new/updated unit tests; ConsoleRenderer extracted for summary output.

Reviewed by Cursor Bugbot for commit d2c916c. Bugbot is set up for automated code reviews on this repo. Configure here.

shujaatTracebloc and others added 16 commits June 12, 2026 09:08
…rted records (#230)

Follow-up to #223, which fixed batch API-send failure accounting but was
only verified end-to-end for token_classification. Auditing the other 10
modalities found two real bugs of the same silent-success class plus a
diagnostics gap:

1. Five templates (image_classification, tabular_classification,
   tabular_regression, time_series_forecasting, time_to_event_prediction)
   logged and SWALLOWED exceptions in main()'s except-Exception handler.
   Any exception escaping ingest() — validation failure, DB error, or the
   fail-loud backend-registration RuntimeErrors from base.py — ended with
   exit code 0 and a K8s Job marked Succeeded. The "No data found for
   table name" cascade from the original incident would still have exited
   0 in these five modalities even after #223, because it surfaces as an
   exception, not as failed records. Each handler now re-raises (matching
   the other six templates).

2. _process_batch paired the API send with zip(ids, batch), which pairs
   positionally and truncates to len(ids). After a MID-batch DB failure
   (insert_batch's per-record fallback appends successes in scan order)
   the DB-failed record was sent to the API — a phantom backend record
   pointing at no MySQL row — while the last inserted record was never
   sent. Now the send list is the batch minus DB failures, matched by
   data_id, the same rule _flush_batch already uses for accounting.

3. send_global_meta_meta, send_generate_edge_label_meta, prepare_dataset
   and create_dataset still had the pre-#223 str(e)[:100] truncation with
   no response attached — the backend's field error explaining a 400 was
   cut right after "HTTP 400: ". Mirrored the send_batch fix: response
   attached to the raised HTTPError, handler logs HTTP <status>: <body>
   capped at 2,000 chars.

tests/test_modality_failure_accounting.py (39 tests, additive only):
api-send-failure + clean-run accounting parameterized across ALL 11
template categories (the #223 tests ran category=None, skipping the
file-transfer branch), an AST structural check that every template's
except-Exception handler re-raises, a mid-batch mis-send regression
guard, diagnostics guards for the four client methods, and a guard that
create_dataset's raise escapes ingest().

Full suite: 955 passed, 1 xfailed, coverage 97% (gate: 95%).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…#231)

Addresses the review on #230: the ~20-line run-and-report block (run
ingest(), log each failed record, sys.exit(1) on failures, re-raise hard
errors) was inlined in all 11 template scripts. The hand-maintained
copies had already drifted — five swallowed exceptions and exited 0
until #230, and the failed-record log line was wrong in three different
ways (wrapper-level .get("filename") logged None for MLM/text; the
tabular templates' .get("name") always logged Unknown).

The contract now lives once in
tracebloc_ingestor/utils/template_runner.py, exported from the package
root; each template's main() builds its modality-specific ingestor and
calls run_ingestion(ingestor, config.LABEL_FILE, ...). The helper also
fixes the failed-record identifier: filename where the category has one,
else data_id, else Unknown.

The two structural template tests are migrated to pin the new invariant
(every template imports + calls run_ingestion; any reintroduced
except-Exception/bare-except in main() must re-raise) — their old
assertions matched the inlined source pattern this change removes. The
behavioral guarantees they pinned move to direct unit tests in
tests/test_template_runner.py (SystemExit(1) on failed records,
re-raise on hard errors, SystemExit passes the handler untouched,
context-manager unwinding, identifier fallback chain).

Full suite: 962 passed, 1 xfailed; template_runner.py at 100% coverage,
total 97% (gate: 95%).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…-json

fix(validators): accept single-dict JSON top-level shape (issue #232)
…congruence guard (#240)

instance_segmentation was a zombie category in the declarative path: the
schema enum accepted it and conventions.py resolved defaults for it, but
map_validators had no branch (validation trivially passed with zero
checks), map_file_transfer had no branch, and the file-transfer gate in
BaseIngestor._ingest_with_lock excluded it - so a customer submitting
category: instance_segmentation got rows in MySQL, records at the backend
API, and a registered dataset with ZERO files staged to DEST_PATH.
Training then failed on missing files: the silent-half-ingest pattern
from #99.

Fail fast instead of half-implement:

- Remove instance_segmentation from the schema enum (3 sites), from
  TaskCategory, and from conventions.py. Submissions now fail at
  validation, before any DB/network I/O, with the supported list:
    category: 'instance_segmentation' is not one of ['image_classification', ...]
  Precedent for narrowing v1 to kill a silent failure: #213 (MLM label
  rejection). Re-adding when real support lands is purely additive.

- Unify base.py's _SRC_PATH_REQUIRED_CATEGORIES and the inline
  file-transfer gate list into one _FILE_BEARING_CATEGORIES set - their
  divergence (instance_segmentation sat in the first, not the second) is
  exactly how this shipped half-wired.

- Add tests/test_category_congruence.py: every category the schema enum
  accepts must resolve a non-empty validator set, file-bearing categories
  (derived from _data_format_for) must equal the file-transfer gate and
  have an explicit map_file_transfer branch, and every image category
  must have file_options defaults. The next category cannot ship
  half-wired without tripping these.

Full implementation of instance_segmentation stays blocked behind the
mask-sidecar wiring for the declarative path (#136 xfail in e2e).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
inspect(engine).get_columns() against a live MySQL returns DIALECT type
classes (INTEGER, VARCHAR, FLOAT, DATETIME, DECIMAL, ...), but
type_mapping was keyed by GENERIC SQLAlchemy class names (Integer,
String, ...). No reflected type ever matched, so every non-VARCHAR
column fell through to the "VARCHAR" default — the schema POSTed to the
backend via send_global_meta_meta reported INT/FLOAT/DATETIME columns
as VARCHAR. (VARCHAR columns only came out right because the fallback
happened to be VARCHAR.)

Match on the upper-cased class name instead: dialect class names
already ARE the MySQL keywords, so the map now only translates the
generic names that differ (String->VARCHAR, Integer->INT,
Numeric->DECIMAL, ...) and unknown types pass through verbatim instead
of being mislabelled VARCHAR. Also:

- DECIMAL/NUMERIC carry precision/scale: DECIMAL(10,2)
- BOOL columns (reflected by MySQL as TINYINT(1)) report as BOOLEAN
- bare String no longer renders as "VARCHAR(None)"

The old unit test fed generic types into a mocked inspector, so it
could not see the bug; it now pins the dialect-type behaviour, and a
new e2e test does declared type -> real CREATE TABLE -> real reflection
on MySQL 8 — the shape that would have caught this.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…, #237) (#242)

The DataValidator gate, the CSV cast, and the JSON per-record check each
decided independently what a declared type permits and what counts as
missing, so they drifted and disagreed — a file passed validation and then
crashed mid-ingest:

- #237: the validator read CSVs with pandas' default NA set (NA/null/None ->
  missing) but the non-tabular ingestor read with na_values=[""] only, so an
  "NA" token in a numeric column passed the gate then crashed the cast.
- #236: an out-of-int64 value coerced to a finite float at the gate (passed)
  but at ingest pandas read it as object/string dtype and the overflow guard
  threw a cryptic numpy "ufunc 'isinf'" / "Integer out of range".

New tracebloc_ingestor/utils/coercion.py centralises both decisions and is
consumed by all three layers, so the gate and the ingest can't disagree:

- build_csv_na_values(schema): per-column NA set applied to every schema
  column (numeric, date, string alike), used by BOTH the ingestor read and
  the validator's CSV reads with keep_default_na=False. Non-schema columns
  (filename, mask_id) are omitted so a file named "NA.jpg" survives.
- int_range_error / int_value_overflows: object/string-dtype-safe int64
  range check (coerce-then-check, never isinf/'>' on a raw object series).
  Out-of-range INT/BIGINT values are now rejected with a clear, actionable
  message ("declare the column as BIGINT") at BOTH preflight and ingest.

A 26-digit value in a FLOAT column is correctly accepted (valid float).
VARCHAR NA-handling is unchanged (schema string columns still treat NA/NULL
as missing — the team's documented convention).

Adds tests/test_coercion_consistency.py (cross-layer verdict matrix) and
rewrites test_csv_na_values.py for the unified policy.

Closes #236, #237.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…fast (#234, #235) (#243)

Records dropped by the skip paths never reached failed_records, so
run_ingestion exited 0 and the K8s Job was marked Succeeded even when rows
were silently lost (#234). And JSON's read layer silently skipped a bad
record while CSV aborts, so the two formats behaved inconsistently (#235).

#234 — close the silent-skip channels:
- A record dropped by process_record (blank/invalid unique_id, or a
  processing error) is now appended to failed_records, so the run exits
  non-zero instead of reporting clean success.
- IngestionSummary.has_failures and the summary's total_failures now include
  skipped_records — fixing the contradiction where a run printed
  "0 failure(s)" while the success-rate text said "most records failed".
- An invalid intent now fails fast (raises before any lock/DB work). It's a
  run-wide config error, not a per-row skip; previously it silently dropped
  every record and still exited 0.

#235 — make CSV and JSON consistent:
- Through the real ingest() path BOTH formats already fail-fast at the
  DataValidator gate on a bad cell; only JSON's read-layer diverged by
  silently skipping records the gate missed. _iter_validated_records now
  raises (fail-fast) like the CSV cast and the gate, so a bad record aborts
  with a clear non-zero exit instead of vanishing.

Per-row tolerance (keep good rows, drop+count bad ones — the issue's
"preferred direction") is intentionally deferred: it requires the
DataValidator gate to stop blocking on per-cell errors, a change to the
gate's role best done with the validator/registry refactor rather than
bundled into this correctness fix.

Closes #234.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…) (#244)

The per-row `filename` and `mask_id` come straight from the user's CSV/JSON
manifest and flowed unsanitised into os.path.join on both the read
(SRC_PATH) and write (DEST_PATH) sides of file_transfer. A crafted value
escaped both sandboxes:

- An ABSOLUTE value drops the sandbox prefix entirely
  (os.path.join("/data/shared","images","/etc/passwd") == "/etc/passwd").
- A "../../" value walks up out of the sandbox.

On the shared cluster PVC (multi-tenant / federated deployments) this is a
read primitive (copy another tenant's file or a pod-mounted secret INTO the
dataset, whose metadata then syncs to the platform) and an arbitrary-write
primitive (land the copy outside DEST_PATH, e.g. /etc/cron.d/evil). Affects
every file-bearing category.

Add _safe_join(root, *parts): normalise the join with os.path.abspath and
reject anything resolving outside `root`. Wire it at all seven
user-influenced join sites (_find_src; the image/annotation/text/mask DEST
writes; the text SRC read; _find_mask_src). A traversal/absolute value now
raises a clear ValueError (caught per-record by the ingest loop, so the
record is rejected and counted as a failure and the run exits non-zero); a
legitimately-missing sidecar still returns None (tolerated skip).

abspath (not realpath) is used so legitimately symlinked PVC mounts keep
working; the threat is manifest-driven ".."/absolute injection, not
attacker-controlled symlinks on the user's own staged data.

tests/test_file_transfer_path_traversal.py demonstrates each read/write
primitive is blocked and the legitimate basename path is unchanged.

Closes #239.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Grouped low-severity fixes found during ingestor stress-testing; each small
and independent.

1. Wrong delimiter -> misleading error. A ;/tab/pipe-delimited file (European
   Excel export, or a TSV saved as .csv) parses as a single column and the
   error was "Schema columns not present in CSV: <every column>". When the
   file parsed as one column whose header still holds the delimiter,
   CSVIngestor._validate_csv now hints at the real delimiter.

2. NUL byte silently truncates a field. pandas' C parser truncates "a\x00b"
   to "a" at read time — silent corruption. The NUL is valid UTF-8 so it
   slipped past the encoding preflight; _check_csv_encoding now also rejects
   NUL bytes up front with a clear message (single raw-byte scan, runs before
   read_data).

3. TableNameValidator message/impl mismatch. The compiled regex forbade a
   leading underscore while the error message AND valid_pattern metadata both
   said it was allowed, so "_tmp" was rejected with a message claiming it was
   valid. MySQL permits a leading underscore; the regex is relaxed to
   ^[a-zA-Z_][a-zA-Z0-9_]*$ to agree with the (correct) message/metadata.

4. Config numeric env coercion. A non-numeric MYSQL_PORT / BATCH_SIZE raised
   a raw "invalid literal for int()" at property access; Config now raises a
   clear "<FIELD> must be an integer" naming the env var to fix.

Closes #238.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…#246)

`pip install tracebloc_ingestor` and the cosign-signed Docker image both
pulled in pytest, black, flake8, mypy, isort and twine, because setup.py fed
the entire requirements.txt (which mixed runtime + dev tooling) into
install_requires — and even fed the section-comment lines verbatim.

- requirements.txt is now runtime-only and declares numpy explicitly (it's
  imported directly by csv_ingestor / data_validator; was relied on only
  transitively via pandas).
- requirements-dev.txt (new) carries the test / lint / type / release tooling
  and `-r requirements.txt`.
- setup.py filters blank / comment / -r lines when building install_requires.
- tests.yml / e2e.yml install requirements-dev.txt.

Verified: the built wheel's Requires-Dist is exactly the 13 runtime deps, no
dev tooling. The signed image (Dockerfile installs requirements.txt) and PyPI
consumers no longer ship test/lint tooling. Publish workflows are unaffected
(they install setuptools/wheel/twine directly, not via requirements.txt).
Full suite green (962 passed, 97% coverage).

First step of the data-ingestors refactor (packaging hygiene); the structural
phases follow once the in-flight fix PRs merge.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
… modality (#247)

Adds e2e/test_characterization.py — the safety net for the upcoming
structural refactor. For each bundled template dataset it runs the real
engine into real MySQL and pins three observable dimensions the refactor must
preserve:

1. MySQL rows — count == source manifest; standard-column semantics
   (data_intent, single non-null ingestor_id, unique non-null data_id);
   feature-value round-trip for the tabular family (catches type corruption).
2. DEST_PATH file manifest — the sidecar files copied for file-bearing
   categories (catches the insert-rows-but-copy-no-files silent-half-ingest
   class).
3. Backend payloads — the records + metadata handed to the APIClient,
   captured by spying on the APIClient method args (CLIENT_ENV=local
   short-circuits the HTTP call before the payload is serialised).

Covers all 9 cleanly-ingesting modalities. Expectations are DERIVED from the
source files (row counts, sidecar listings, schema) — no hardcoded magic
values — so the harness stays honest if a template changes.

It characterises CLEAN-input behavior, which the in-flight fix PRs (#242-#245)
don't change (they only touch malformed-input handling), so the goldens hold
on develop today and become the contract the refactor is checked against.

Building it already surfaced two behaviors worth pinning: MySQL FLOAT is
32-bit (so a ~1e-6 round-trip delta is correct, not corruption), and the
label column is mapped onto the standard `label` column (so it's absent from
the physical feature schema).

Runs in the e2e job (real MySQL); auto-skipped in the unit suite when no
MySQL is reachable (existing e2e/conftest mechanism).

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…my_type (#249)

`DataValidator._validate_char` accepts CHAR(N) — the validator vocabulary
includes CHAR, with explicit length enforcement and NULL tolerance just
like VARCHAR. But `_get_sqlalchemy_type` in the DDL layer had no CHAR
entry, so a user schema like `code: CHAR(1)` passed every preflight
validator and then died at table creation with:

    ValueError: Unsupported MySQL type: CHAR(1)

— the two layers had divergent type vocabularies, exactly the class of
mismatch #192 already fixed for NUMERIC.

Add CHAR to `type_mapping` (imported from sqlalchemy core), so:
  - `CHAR(N)` → CHAR(length=N) — fixed-width, padded per MySQL's CHAR
    semantics
  - bare `CHAR` → CHAR class, dialect picks the default length (1)

CHAR is functionally close to VARCHAR but semantically distinct (MySQL
right-pads CHAR to its declared length on storage), so using the
SQLAlchemy `CHAR` type rather than aliasing to `String` preserves the
declared intent.

Adds regression tests:
  - CHAR(1) → CHAR instance with length == 1
  - bare CHAR → CHAR (class or instance, dialect-defaulted)

Surfaced by an adversarial all-types tabular schema run against
v0.3.10-rc1: a CSV column declared `CHAR(1)` failed table creation
despite every type in the schema being documented as supported.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`read_data` caught `pd.errors.EmptyDataError` and silently returned an
empty generator (just emitted a YELLOW WARNING). The ingestor then:
  1. proceeded with an empty record stream
  2. created an empty MySQL table
  3. called `send_generate_edge_label_meta` against zero rows
  4. backend 400'd with `"No data found for table X and injestor_id Y"`
  5. PR #187's fail-loud handler surfaced that as
     `Backend rejected edge-label metadata; the dataset was NOT
     registered (its rows are already in the database).`

— a message that BLAMES the backend and even falsely tells the user
their rows are in the database, when the real cause was an empty
input file. Same misleading-cascade shape #213 documents for the
self-supervised + label mismatch.

Fix: raise ValueError at the read layer with a source-truthful message
naming the path and pointing at staging as the most-likely cause
(PVC mount, helm-vs-staging timing). DataValidator's existing
"No data found to validate" / preflight path catches it BEFORE any
backend round-trip, so the user sees the right reason at the right
layer.

Updates the existing `test_read_data_empty_file_returns_nothing` (which
asserted the old silent-empty behavior) to assert the new fail-fast
contract instead. Renamed to
`test_read_data_empty_file_raises_with_clear_message`.

Surfaced by an adversarial test pass against v0.3.10-rc1: an empty CSV
ended a chain that surfaced a confusing backend-blaming error several
layers removed from the actual fault.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onsoleRenderer (#248)

Structural refactor phase P2 (backend#796) — pull presentation out of the
ingestion logic. The summary box (success rate, per-channel counts, status
banner) was ~114 lines of print + ANSI + emoji living inside the 1,196-line
BaseIngestor god class.

- New tracebloc_ingestor/reporting.py with ConsoleRenderer.render_summary —
  the single home for that presentation. The body moved VERBATIM, so the
  customer-facing output is byte-for-byte identical.
- BaseIngestor._log_summary becomes a thin delegate (kept as a method so the
  existing callers/tests that reference BaseIngestor._log_summary keep
  working). base.py: -119/+10 lines.
- Drops the now-unused BLUE import from base.py.
- IngestionSummary stays the pure data object the renderer consumes; the
  renderer type-hints it via TYPE_CHECKING to avoid a runtime import cycle.

Payoff: the presentation is now unit-testable without a DB or a full ingest —
tests/test_reporting.py (8 cases, reporting.py 100% covered) locks the output
contract, including #234's "dropped records count toward the failure total /
disqualify the success banner" behavior.

Behaviour-preserving: full unit suite 1026 passed (97.3% coverage); the
characterization harness goldens (#247: DB cells / DEST_PATH manifest /
backend payloads) are unaffected — they don't assert the summary box — and the
CI e2e job re-confirms against real MySQL.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
… backend reason (issue #251) (#252)

* fix(ingest): catch single-label classification at preflight + surface backend reason (issue #251)

Two coupled improvements to the user experience when a classification
dataset has fewer than 2 distinct label values — surfaced by adversarial
testing against v0.3.10-rc1.

## The cascade we hit

Submit a `tabular_classification` (or any classification-family) dataset
where the `label` column has only ONE distinct value across all rows
(e.g. a 10-row test fixture all labelled `"X"`):

  1. Validators pass — no per-record issue, just label uniformity.
  2. All rows insert into MySQL successfully.
  3. `prepare_dataset` calls `/global_meta/prepare/`.
  4. Backend returns HTTP 400: `{"message":"Please provide atleast 2 labels."}`
  5. The framework raises a generic `"Backend failed to prepare the
     dataset; it was NOT registered (its rows are already in the
     database). See the logged API error above."`

The user-visible error says "see the logged API error above" — but the
ACTUAL reason ("Please provide atleast 2 labels") is one ERROR layer up
and easy to miss. Rows are orphaned in MySQL by the time anything
surfaces.

## Fix 1 — preflight validator (the right layer)

New `LabelDiversityValidator` in
`tracebloc_ingestor/validators/label_diversity_validator.py`:

  - Reads ONLY the label column when given a CSV path (usecols=[...] —
    a multi-GB proteomics panel doesn't get materialised to count
    labels).
  - Counts non-null distinct values; rejects when < min_distinct
    (default 2 — matching the backend's exact contract).
  - Error message NAMES the offending distinct value(s) and the
    value-count breakdown so the user immediately sees what's wrong
    with the input ("1 distinct value(s): ['X']. Value counts:
    {'X': 10}").
  - Suggests the regression-family categories as the right home for a
    single-target dataset, so users with continuous targets get a
    pointer instead of being told to fake a second class.
  - Gracefully no-ops on empty input or a missing label column — other
    validators already surface those clearly; don't double-report.

Wired in via `validators_mapping.py` for every classification-family
category: tabular_classification, image_classification,
text_classification, object_detection, semantic_segmentation, and
keypoint_detection. NOT wired for regression
(tabular_regression / time_series_forecasting /
time_to_event_prediction) — continuous targets legitimately have
"1 distinct value" semantics — nor for masked_language_modeling /
token_classification (self-supervised or per-token labels).

## Fix 2 — propagate backend body into the user-visible error

`APIClient.prepare_dataset`'s error handler now stashes the formatted
backend response (`HTTP <status>: <body>`) on
`self.last_prepare_error`. `base.py`'s "Backend failed to prepare"
RuntimeError reads that and includes the actual backend reason verbatim
in the user-visible message.

Before:

    Backend failed to prepare the dataset; it was NOT registered
    (its rows are already in the database). See the logged API error
    above.

After:

    Backend failed to prepare the dataset; it was NOT registered
    (its rows are already in the database). Backend response:
    HTTP 400: {"message":"Please provide atleast 2 labels."}

A user piping the log through filtered output (CI, error tooling, paste-
into-Slack) sees the cause in the visible line — no more grep
required.

## Tests

`tests/test_label_diversity_validator.py` (14 cases):
  - Positive: 2+ labels, many labels, nulls don't count, case-insensitive
    column match
  - Failure: single label (with value-counts in message), all-null,
    one-value-plus-nulls
  - Error mentions the regression alternative for single-target users
  - Defensive: empty input / missing column → silent pass (other
    validators own those errors)
  - CSV-path: reads only the label column (no full-table OOM risk),
    rejects single-label CSV at the file level
  - Custom label column name

Extended `tests/test_api_client_methods.py`:
  - On HTTP error, `last_prepare_error` populated with status + body
  - On a clean ingestor, `last_prepare_error is None`
  - On network error (no response), still populated with the
    stringified exception

(Backend-side typo "atleast" → "at least" lives in the backend repo —
flagged in #251 for whoever maintains that endpoint.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(validators): label-diversity loader fails closed + uses pandas header parse (bugbot #252)

Two bugbot findings on the label-diversity preflight (#251):

- High: _load_data swallowed read errors and returned None, which validate
  treats as empty → valid. A single-label CSV whose read failed could skip
  the gate and hit the backend rejection. Read errors now propagate to
  validate's handler and fail the check (fail-closed).

- Medium: header detection used a naive header_line.split(",") that
  diverges from pandas on quoted/alternate-delimiter headers. On a miss it
  fell back to nrows=1 and counted distinct labels on one row, rejecting a
  diverse dataset. Now resolves the column against pandas' own header parse
  (nrows=0); a genuinely-absent column returns None (benign skip).

Also update test_validators_mapping for the LabelDiversityValidator now in
the classification chains (was a stale pre-existing failure on this branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(label-diversity): address Cursor Bugbot follow-up on PR #252

- prepare_dataset clears `last_prepare_error` up front so an early
  `return False` (local mode / invalid category) can't leave a stale
  message that base.py then attaches to an unrelated failure.
- LabelDiversityValidator now accepts the schema so the label column is
  read with the same NA / dtype rules as CSVIngestor / DataValidator.
  Without it, the distinct-label count could disagree with what
  actually ingests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(label-diversity): cover NA-alignment with ingest read rules (bugbot #252)

Tests for the follow-up fixes in 1539301:
- schema label: NA sentinels ("null"/"NA") read as missing → single-label
  correctly flagged
- non-schema label: "NA"/"null" kept as genuine classes (matches ingestor)
- string-schema label: numeric-looking values ("01","1","1.0") not collapsed
  by numeric inference

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(label-diversity): use full schema for label NA rules, not stripped (bugbot #252)

base.py deletes the label/annotation/id columns from file_options["schema"]
(they're framework columns, not table columns), but CSVIngestor reads the
file with NA/dtype rules from the FULL self.schema — so the label column DOES
get NA-sentinel treatment at ingest. The label-diversity validator was wired
with the stripped schema, so _schema_type_for(label) always returned None and
the NA/dtype rules never applied to the label: "null"/"NA"/"" inflated the
distinct count at preflight while ingestion treated them as missing, letting
an effectively single-class dataset slip past the gate and fail at backend
prepare.

base.py now also passes the unstripped schema as `full_schema`; the validator
factory prefers it (falling back to `schema` for direct/test callers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(label-diversity): resolve label column whitespace-insensitively (bugbot #252)

CSVIngestor strips column-name whitespace on read (chunk.columns.str.strip()),
so a header like " label " is ingested as `label`. The validator resolved the
label column against the raw header, treated " label " as missing, skipped the
diversity check with a warning, and a single-class CSV passed preflight only
to fail at backend prepare. _resolve_column and _schema_type_for now strip
(and lowercase) on both sides, matching the ingestor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… raw JSON-schema mechanic (#254)

* fix(cli): surface schema descriptions in validation errors instead of raw mechanic

The ingest.yaml schema (``schema/ingest.v1.json``) carries beautifully
written ``description`` fields explaining each ``allOf`` rule's intent,
rationale, and (where applicable) a fix hint — e.g. for the
self-supervised + label rule:

    "Self-supervised categories MUST NOT set `label`. The shipped CSV
     has no label column, and the framework registers no edge-label
     metadata for them. Setting `label` anyway used to ingest rows
     successfully, then crash at backend registration with a misleading
     HTTP 400 'No data found' (issue #213). Reject at submission
     instead."

But ``_format_errors`` only emitted the raw ``ValidationError.message``,
which for a ``not`` / ``oneOf`` / ``allOf`` rule reads:

    <root>: {'apiVersion': 'tracebloc.io/v1', ..., 'label': 'label'}
            should not be valid under {'required': ['label']}

— technically correct, but it dumps the entire submission as a Python
dict and uses JSON-schema vocabulary the customer didn't write. A user
who copy-pasted ``label: label`` from another category's YAML can't
tell from the mechanic that MLM is self-supervised, or how to fix it.

Surfaced by adversarial testing: when the user-visible error fires for
this rule, the schema's helpful description never reaches the user.

## Fix

New helper ``_describe_from_schema_path(schema, absolute_schema_path)``
walks the schema from root toward the failing rule, capturing the
deepest ``description`` field encountered along the way (descriptions
are inherited by sub-trees per JSON-schema convention; the deepest one
is the most specific).

``_format_errors`` uses the helper when a description is available:

  <root>: Self-supervised categories MUST NOT set `label`. The shipped
          CSV has no label column, and the framework registers no
          edge-label metadata for them. Setting `label` anyway used to
          ingest rows successfully, then crash at backend registration
          with a misleading HTTP 400 'No data found' (issue #213).
          Reject at submission instead.
      (rule: not={'required': ['label']})

The mechanic still appears on a follow-up line for power-user debugging
without burying it in the headline. For primitive checks that already
have decent messages (``'label' is a required property``, enum
violations), we keep ``e.message`` unchanged — descriptions only attach
when one exists in the schema.

## Test

New ``test_schema_violation_surfaces_description_not_raw_mechanic`` in
``tests/test_cli_run.py``:
  - Submits MLM with ``label: label``
  - Asserts the user-visible error contains the schema's description
    keywords ("Self-supervised", "MUST NOT set `label`")
  - Asserts the rule mechanic still appears on a follow-up line
  - Asserts the old behavior (full YAML dict dumped in the headline) is
    NOT present
  - Asserts no DB / network I/O happens before validation rejects

Existing ``test_schema_violation_fails_fast`` (missing required fields)
continues to pass — primitive ``"X is a required property"`` messages
don't have descriptions in the schema and fall through to the original
``e.message`` path unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cli): description walker must skip the schema's root description (bugbot #254)

Cursor Bugbot caught a real bug in the previous commit: the walker
captured the root schema's ``description`` on its first iteration —
before descending — and the root description in ``ingest.v1.json`` is
a generic blurb about the schema as a whole ("Declarative configuration
for the tracebloc data ingestor..."). The result: EVERY validation
error returned a non-None description, so ``_format_errors`` replaced
jsonschema's clear primitive messages ("'label' is a required property")
with that generic prose plus a confusing ``(rule: required=...)`` line
— contradicting the very fallback path the previous commit was
supposed to preserve.

Fix: walk step-by-step, only capturing descriptions on nodes we
DESCEND INTO. The root is never captured. ``allOf`` branches,
sub-property definitions, and other inner nodes are — those are the
descriptions authored to explain specific rules.

## Verifying

Two tests pin the fix:

1. Updated ``test_schema_violation_fails_fast`` (the missing-required-
   field case) now asserts:
     - ``"is a required property"`` IS in the output (the primitive
       message survives)
     - ``"Declarative configuration"`` is NOT in the output (root
       description does not leak)

2. New unit test ``test_describe_from_schema_path_skips_root_description``
   on the walker itself:
     - Primitive ``["required"]`` path → returns None (no inner desc)
     - Generic property-type path → returns None
     - ``allOf`` branch path → returns the rule-specific description

The MLM+label end-to-end test (``test_schema_violation_surfaces_description_not_raw_mechanic``)
keeps passing — the rule-specific description is still surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Collaborator

👋 Heads-up — Code review queue is at 15 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Release bundles the fixes merged to develop since 0.3.9 (#230#254):
- Ingestion accounting: dropped records fail the run; JSON read-layer fails
  fast (#230, #234, #235)
- Coercion: single source of truth for NA policy + int64 range (#236, #237)
- Security: block path traversal via manifest filename/mask_id (#239)
- UX papercuts: delimiter hint, NUL truncation, table-name message, Config
  numeric coercion (#238)
- Schema/DB: real MySQL column types, CHAR(N) mapping, drop
  instance_segmentation from enum, empty-CSV fast fail (#240, #241, #249, #250)
- DataValidator: accept single-dict / filter non-dict JSON (#232, #233)
- Single-label classification caught at preflight + friendly backend reason,
  with the full bugbot-hardened LabelDiversityValidator (#251, #252)
- CLI: schema descriptions surfaced in validation errors (#254)
- Reporting: ConsoleRenderer extraction; packaging split (#248, #246)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@divyasinghds divyasinghds changed the title Develop to prod Develop to prod 0.3.10 Jun 15, 2026
aptracebloc
aptracebloc previously approved these changes Jun 15, 2026
saadqbal
saadqbal previously approved these changes Jun 15, 2026
…s duplication (issue #261) (#262)

A new-user CSV with ``"  A  "`` (whitespace-padded) mixed with ``"A"``
landed in MySQL as TWO distinct classes — the ingestor reported
``🎉 Ingestion completed successfully!``, the LabelDiversityValidator
happily passed (2 distinct values ≥ 2), and a model trained on the
output learned 3 classes instead of the 2 the user intended. Silent
label-set corruption.

The framework already strips column HEADERS (``chunk.columns.str.strip()``
in csv_ingestor) and the ``data_id`` column (line above in
process_record). Extending the same convention to label-column VALUES is
internally consistent — a label is a class identifier, and class
identifiers don't carry whitespace semantics. (Other VARCHAR columns —
free-text descriptions, names, etc. — are left alone; only the label
column gets this treatment.)

## Two changes

### 1. Write-side strip — ``BaseIngestor.process_record``

Strip ``label_val.strip()`` for string labels before ``label_policy.apply``
runs. Numeric labels (INT class IDs) and other non-strings pass through
unchanged.

### 2. Preflight WARNING — ``LabelDiversityValidator``

Collapse whitespace-equivalent values BEFORE counting distinct, AND
surface a WARNING listing the pre-strip variants when collapsing
changed anything. The validator and the write side now use the same
distinct count, so a dataset with ``["  A  ", "A", "B"]``:
  - Validator: distinct=2 (A, B), warning lists ``{"A": ["  A  ", "A"]}``
  - Ingest: stores ``A`` and ``B`` — 2 classes in MySQL, as intended

A dataset where ALL variants collapse to a single class
(``["  A  ", "A", "  A"]``) still fails diversity:
  - Validator: distinct=1, rejects with a message that explicitly
    mentions whitespace stripping so the user knows what the validator
    did

## Tests

``test_label_diversity_validator.py`` (2 new):
  - ``test_whitespace_duplicates_collapsed_and_warned`` — mixed
    dataset passes; warning names the variants
  - ``test_whitespace_only_single_value_still_fails`` — all-collapse
    case rejects with a clear message

``test_ingestor_base.py`` (3 new):
  - ``test_process_record_strips_whitespace_from_string_label`` —
    ``"  A  "`` → ``"A"``
  - ``test_process_record_label_strip_makes_whitespace_variants_equivalent``
    — two records with whitespace variants produce identical cleaned
    labels
  - ``test_process_record_label_strip_preserves_non_string_labels`` —
    INT 42 passes through unchanged

## Surfaced by

Adversarial new-user test pass against v0.3.10-rc2 (scenario N15):
shipped CSV had ``"  A  "`` from an Excel paste, MySQL ended up with 3
distinct classes ``{"  A  ", "A", "B"}`` and the ingestor reported
success.

## Related

- #251: LabelDiversityValidator caught the inverse case (1 distinct).
  This fixes the silent-but-more-distinct sibling: validator accepts
  because the strings are bytewise distinct, but the user has fewer
  semantic classes than that.
- The whitespace strip on column HEADERS in csv_ingestor (already in
  develop) establishes the convention; this extends it to label values.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@divyasinghds divyasinghds dismissed stale reviews from saadqbal and aptracebloc via cab8514 June 15, 2026 11:09
divyasinghds and others added 3 commits June 15, 2026 16:40
Newline-delimited JSON (JSONL) is a common export format from log /
event pipelines (Datadog, Splunk, Kafka, ...) — one JSON object per
line, no top-level array. ``json.load`` rejects JSONL with the opaque
``json.JSONDecodeError: Extra data`` (Python 3.12+) or ``Trailing
data`` (earlier 3.x), which our preflight surfaced verbatim:

    Label-diversity validation error: Trailing data

A new user looking at that error has no idea (a) that their file is
JSONL, (b) that JSONL isn't a supported input format, or (c) how to
fix it.

## Fix

In ``DataValidator._load_data``'s .json branch, wrap ``json.load`` in
a try/except. On ``JSONDecodeError`` whose message says "Extra data"
or "Trailing data", run a JSONL probe: if the file's first non-empty
line itself parses as JSON, we have JSONL → raise a clear ValueError
naming the pattern and the fix:

    <path> looks like newline-delimited JSON (JSONL) — one JSON object
    per line. JSONL isn't supported as an input format; the ingestor
    expects a top-level JSON array of records or a single object.
    Combine the records into an array (wrap with ``[...]`` and join
    with ``,``) and re-ingest.

The detection is conservative — both the original error class AND the
first-line-parses check must hold. A genuinely malformed JSON file
(e.g. ``{trailing garbage``) doesn't trigger the JSONL message because
its first line doesn't parse standalone.

Also adjust ``_load_data``'s outer ``except Exception``: let
``ValueError`` propagate (so this fix message reaches the validator's
result), but keep swallowing other I/O exceptions as before.

## Tests

- ``test_loads_from_json_jsonl_surfaces_clear_error``: JSONL input →
  error mentions JSONL + the fix (combine into array)
- ``test_loads_from_json_malformed_not_misidentified_as_jsonl``:
  genuinely broken JSON falls through to the generic error path

## Surfaced by

Adversarial new-user test pass against v0.3.10-rc2, scenario N12.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A new user typing ``BIGINTEGER`` (mixing MySQL's BIGINT with Python's
``int``), ``BOLEAN`` (missing an O), ``NUMRIC`` (missing an E), etc.
used to get a bare ``Unsupported MySQL type: <typo>`` error with no
hint about what's actually supported. They'd have to spelunk the
schema docs or guess.

## Fix

Add a Levenshtein-based "Did you mean X?" hint to ``_get_sqlalchemy_type``'s
ValueError. The threshold (d ≤ 3) is the empirical sweet spot:
  - close enough to catch every realistic typo we've seen in adversarial
    testing (single-letter swap, missing letter, repeated letter,
    common prefix/suffix confusion)
  - wide enough to fail-silently on genuine unrelated types (a real
    ``GEOMETRY`` request stays "Unsupported" with no misleading hint)

The error ALSO prints the full supported-type list so:
  - the suggestion gets the user 80% of the way there
  - the list catches the cases where the closest-by-distance suggestion
    isn't the closest-by-intent (e.g. BIGINTEGER's nearest by edits is
    INTEGER (d=3), even though the user clearly wanted BIGINT (d=4) —
    the printed list has BIGINT right next to INTEGER)

## Tests

- ``test_get_sqlalchemy_type_typo_suggests_correction``: BIGINTEGER
  surfaces the INTEGER suggestion + the BIGINT alternative is in the
  supported list
- ``test_get_sqlalchemy_type_typo_no_suggestion_for_distant_input``:
  GEOMETRY (d=6 to nearest) gets no hint, only the supported list
- Parametrised cases over realistic typos: INTGER→INTEGER, NUMRIC→NUMERIC,
  BOLEAN→BOOLEAN, VARCAHR→VARCHAR

## Surfaced by

Adversarial new-user test pass against v0.3.10-rc2, scenario N3:
``BIGINTEGER`` got a bare unhelpful error.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…265)

A validator-rejected ingest used to leave an empty orphaned table behind:
`create_table` fired inside `__init__`, BEFORE `validate_data()` ran. On the
next retry the stale-table guard tripped and the user had to manually DROP
before re-running.

Move the `create_table` call out of `__init__` into `_ingest_with_lock`, right
after `validate_data()` accepts the input. The cleaned schema is stashed on
the ingestor at construction time and consumed on first ingest. `self.table`
starts as `None` and is only populated once the table actually exists.

Regression tests:
- Construction alone does not call `create_table` and leaves `self.table` as
  `None`.
- A validator-rejected ingest never calls `create_table`, so no orphan can
  remain for the next run to trip on.
- The happy path still creates the table exactly once, with the cleaned
  schema.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@divyasinghds divyasinghds merged commit c2c66ed into master Jun 15, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants