Skip to content

test(e2e): characterization harness pinning clean-ingest behavior per modality#247

Merged
divyasinghds merged 1 commit into
developfrom
test/characterization-harness
Jun 15, 2026
Merged

test(e2e): characterization harness pinning clean-ingest behavior per modality#247
divyasinghds merged 1 commit into
developfrom
test/characterization-harness

Conversation

@LukasWodka

Copy link
Copy Markdown
Collaborator

Summary

Phase 0 of the refactor — the characterization harness (safety net). Adds e2e/test_characterization.py: for each bundled templates/ dataset it runs the real engine into real MySQL and pins the observable behavior the upcoming structural refactor must preserve. Validated locally against MySQL (docker-compose) across all 9 modalities.

What it pins (three dimensions, per modality)

  1. MySQL rows — row count == source manifest; standard-column semantics (data_intent == configured intent, a single non-null ingestor_id, unique non-null data_id); and for the tabular family a feature-value round-trip (catches type corruption — leading zeros, NA handling, numeric coercion).
  2. DEST_PATH file manifest — exactly the sidecar files copied for file-bearing categories. This directly catches the insert-rows-but-copy-no-files silent-half-ingest class (the instance_segmentation zombie pattern).
  3. Backend payloads — the records + metadata the engine hands the APIClient. Since CLIENT_ENV=local short-circuits the HTTP call before serialising the payload, the harness spies on the APIClient method args (passed regardless of mode) — capturing the engine's intent to send.

Covers all 9 cleanly-ingesting modalities (tabular clf/reg, image clf, text clf, time-to-event, time-series, keypoint, object detection, MLM). Expectations are derived from the source files (row counts, sidecar listings, schema) — no hardcoded magic values — so the harness stays honest if a template changes.

Why it's stable on develop today

It characterises clean-input behavior. The in-flight fix PRs (#242#245) only change malformed-input handling (bad cells, NA tokens, traversal filenames, dropped-record accounting), so these goldens hold now and become the contract the refactor is checked against. When those PRs merge, any golden that legitimately shifts documents the intended change.

It already earned its keep

Building it surfaced two behaviors worth pinning explicitly:

  • MySQL FLOAT is 32-bit — a ~1e-6 feature round-trip delta is correct storage, not corruption (the round-trip tolerates float32 while still catching real corruption).
  • The label column is mapped onto the standard label column — so e.g. regression's price is correctly absent from the physical feature schema.

Operational

Runs in the e2e job (real MySQL service). Auto-skipped in the unit suite when no MySQL is reachable (existing e2e/conftest.py collect_ignore_glob mechanism), so the default pytest + coverage gate is unaffected. New file only — no conflict with #242#246.

🤖 Generated with Claude Code

… modality

Adds e2e/test_characterization.py — the safety net for the upcoming
structural refactor. For each bundled template dataset it runs the real
engine into real MySQL and pins three observable dimensions the refactor must
preserve:

1. MySQL rows — count == source manifest; standard-column semantics
   (data_intent, single non-null ingestor_id, unique non-null data_id);
   feature-value round-trip for the tabular family (catches type corruption).
2. DEST_PATH file manifest — the sidecar files copied for file-bearing
   categories (catches the insert-rows-but-copy-no-files silent-half-ingest
   class).
3. Backend payloads — the records + metadata handed to the APIClient,
   captured by spying on the APIClient method args (CLIENT_ENV=local
   short-circuits the HTTP call before the payload is serialised).

Covers all 9 cleanly-ingesting modalities. Expectations are DERIVED from the
source files (row counts, sidecar listings, schema) — no hardcoded magic
values — so the harness stays honest if a template changes.

It characterises CLEAN-input behavior, which the in-flight fix PRs (#242-#245)
don't change (they only touch malformed-input handling), so the goldens hold
on develop today and become the contract the refactor is checked against.

Building it already surfaced two behaviors worth pinning: MySQL FLOAT is
32-bit (so a ~1e-6 round-trip delta is correct, not corruption), and the
label column is mapped onto the standard `label` column (so it's absent from
the physical feature schema).

Runs in the e2e job (real MySQL); auto-skipped in the unit suite when no
MySQL is reachable (existing e2e/conftest mechanism).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Collaborator Author

👋 Heads-up — Code review queue is at 15 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@divyasinghds divyasinghds merged commit deca72e into develop Jun 15, 2026
6 checks passed
@divyasinghds divyasinghds deleted the test/characterization-harness branch June 15, 2026 06:39
divyasinghds pushed a commit that referenced this pull request Jun 15, 2026
…onsoleRenderer (#248)

Structural refactor phase P2 (backend#796) — pull presentation out of the
ingestion logic. The summary box (success rate, per-channel counts, status
banner) was ~114 lines of print + ANSI + emoji living inside the 1,196-line
BaseIngestor god class.

- New tracebloc_ingestor/reporting.py with ConsoleRenderer.render_summary —
  the single home for that presentation. The body moved VERBATIM, so the
  customer-facing output is byte-for-byte identical.
- BaseIngestor._log_summary becomes a thin delegate (kept as a method so the
  existing callers/tests that reference BaseIngestor._log_summary keep
  working). base.py: -119/+10 lines.
- Drops the now-unused BLUE import from base.py.
- IngestionSummary stays the pure data object the renderer consumes; the
  renderer type-hints it via TYPE_CHECKING to avoid a runtime import cycle.

Payoff: the presentation is now unit-testable without a DB or a full ingest —
tests/test_reporting.py (8 cases, reporting.py 100% covered) locks the output
contract, including #234's "dropped records count toward the failure total /
disqualify the success banner" behavior.

Behaviour-preserving: full unit suite 1026 passed (97.3% coverage); the
characterization harness goldens (#247: DB cells / DEST_PATH manifest /
backend payloads) are unaffected — they don't assert the summary box — and the
CI e2e job re-confirms against real MySQL.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants