refactor(P5a): extract run-level preflight checks into ingestors/preflight#274
Merged
Merged
Conversation
Collaborator
Author
|
👋 Heads-up — Code review queue is at 14 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
Base automatically changed from
refactor/p4c-file-transfer-paths-bridge
to
develop
June 16, 2026 09:34
…light Structural refactor phase P5 (backend#796), first slice of the BaseIngestor god-class decomposition (base.py 1180 -> 1070 lines). The three fail-fast guards that run once per ingest before any lock / DB / row work — _check_src_path (#772 P2), _check_csv_encoding (#238 UTF-8 / NUL), and _check_intent (#234) — were methods on BaseIngestor but are stateless validations of a single input each. Moved verbatim (error messages unchanged) into a focused `ingestors/preflight.py` module of functions; BaseIngestor calls them directly (preflight.check_src_path / check_csv_encoding / check_intent), with no delegate stubs left behind. Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Tests that exercised the checks now call the preflight functions (the bypass patch in test_coverage_gaps2 retargets base_mod.preflight). Removed the now-unused pathlib.Path import from base.py. First of several P5 slices; next candidates: TableLock (lock lifecycle), RecordProcessor (process_record / _map_unique_id), and the batch / DB / API write path — each extracted under the same e2e gate, with the deferred bug fixes (mask_id leak, per-row tolerance #235, atomicity #227 / #260) riding the slice that owns the relevant code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4e46181 to
41aca4e
Compare
divyasinghds
approved these changes
Jun 16, 2026
LukasWodka
added a commit
that referenced
this pull request
Jun 16, 2026
Structural refactor phase P5 (backend#796), god-class decomposition slice 2. The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was three methods + a constant on BaseIngestor. It's a cohesive, stateful responsibility, so it moves verbatim (every error message unchanged) into a `TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a `_table_lock` property. _ingest_with_lock now does `lock = self._table_lock; lock.acquire()` ... `lock.release(path)`. STORAGE_PATH is a Config class constant (not env/override driven), so TableLock reads it from a plain Config() exactly as before — the ~10 lock tests that patch Config.STORAGE_PATH keep working; they now exercise the lock via `ing._table_lock.acquire()` etc. Removed the now-unused `os` and `..config.Config` imports from base.py (their last users moved out in P5a/P5b). Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB / API write path), each behaviour-preserving; the deferred bug fixes (mask_id leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
divyasinghds
pushed a commit
that referenced
this pull request
Jun 17, 2026
Structural refactor phase P5 (backend#796), god-class decomposition slice 2. The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was three methods + a constant on BaseIngestor. It's a cohesive, stateful responsibility, so it moves verbatim (every error message unchanged) into a `TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a `_table_lock` property. _ingest_with_lock now does `lock = self._table_lock; lock.acquire()` ... `lock.release(path)`. STORAGE_PATH is a Config class constant (not env/override driven), so TableLock reads it from a plain Config() exactly as before — the ~10 lock tests that patch Config.STORAGE_PATH keep working; they now exercise the lock via `ing._table_lock.acquire()` etc. Removed the now-unused `os` and `..config.Config` imports from base.py (their last users moved out in P5a/P5b). Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB / API write path), each behaviour-preserving; the deferred bug fixes (mask_id leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
divyasinghds
added a commit
that referenced
this pull request
Jun 17, 2026
…276) * refactor(P5b): extract the cross-ingest table lock into TableLock Structural refactor phase P5 (backend#796), god-class decomposition slice 2. The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was three methods + a constant on BaseIngestor. It's a cohesive, stateful responsibility, so it moves verbatim (every error message unchanged) into a `TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a `_table_lock` property. _ingest_with_lock now does `lock = self._table_lock; lock.acquire()` ... `lock.release(path)`. STORAGE_PATH is a Config class constant (not env/override driven), so TableLock reads it from a plain Config() exactly as before — the ~10 lock tests that patch Config.STORAGE_PATH keep working; they now exercise the lock via `ing._table_lock.acquire()` etc. Removed the now-unused `os` and `..config.Config` imports from base.py (their last users moved out in P5a/P5b). Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB / API write path), each behaviour-preserving; the deferred bug fixes (mask_id leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(P5c): extract the per-record transform into RecordProcessor Structural refactor phase P5 (backend#796), god-class decomposition slice 3. process_record + _map_unique_id — turning a raw source row into the cleaned, DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the label after the configured label policy, data_intent, annotation, framework columns) — move verbatim into a RecordProcessor class (ingestors/record_processor.py). The attribute names match the ingestor's so the bodies are byte-for-byte unchanged. BaseIngestor composes it via a `_record_processor` property (built from the run's column / label / intent config). Its public `process_record` stays as a one-line delegate, since the ingest loop and ~26 test call-sites use `self.process_record` / `ing.process_record`; _map_unique_id (no direct test callers) moves fully inside RecordProcessor. Removed the now-unused pandas / Intent / TaskCategory imports from base.py. Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two tests that reached into the old BaseIngestor._map_unique_id now target RecordProcessor (the label-policy hook lives there). The deferred mask_id cross-layer leak is preserved exactly (process still writes mask_id for semantic_segmentation) — untangling it is a follow-up, not this slice. base.py: 1180 -> 774 lines across P5a+P5b+P5c. Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Divya <divyasingh@tracebloc.io>
divyasinghds
added a commit
that referenced
this pull request
Jun 17, 2026
* refactor(P5b): extract the cross-ingest table lock into TableLock Structural refactor phase P5 (backend#796), god-class decomposition slice 2. The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was three methods + a constant on BaseIngestor. It's a cohesive, stateful responsibility, so it moves verbatim (every error message unchanged) into a `TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a `_table_lock` property. _ingest_with_lock now does `lock = self._table_lock; lock.acquire()` ... `lock.release(path)`. STORAGE_PATH is a Config class constant (not env/override driven), so TableLock reads it from a plain Config() exactly as before — the ~10 lock tests that patch Config.STORAGE_PATH keep working; they now exercise the lock via `ing._table_lock.acquire()` etc. Removed the now-unused `os` and `..config.Config` imports from base.py (their last users moved out in P5a/P5b). Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB / API write path), each behaviour-preserving; the deferred bug fixes (mask_id leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(P5c): extract the per-record transform into RecordProcessor Structural refactor phase P5 (backend#796), god-class decomposition slice 3. process_record + _map_unique_id — turning a raw source row into the cleaned, DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the label after the configured label policy, data_intent, annotation, framework columns) — move verbatim into a RecordProcessor class (ingestors/record_processor.py). The attribute names match the ingestor's so the bodies are byte-for-byte unchanged. BaseIngestor composes it via a `_record_processor` property (built from the run's column / label / intent config). Its public `process_record` stays as a one-line delegate, since the ingest loop and ~26 test call-sites use `self.process_record` / `ing.process_record`; _map_unique_id (no direct test callers) moves fully inside RecordProcessor. Removed the now-unused pandas / Intent / TaskCategory imports from base.py. Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two tests that reached into the old BaseIngestor._map_unique_id now target RecordProcessor (the label-policy hook lives there). The deferred mask_id cross-layer leak is preserved exactly (process still writes mask_id for semantic_segmentation) — untangling it is a follow-up, not this slice. base.py: 1180 -> 774 lines across P5a+P5b+P5c. Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(P5d): extract the batch write path into BatchWriter Structural refactor phase P5 (backend#796), god-class decomposition slice 4 — the final extraction. _flush_batch + _process_batch — insert a batch to MySQL, publish the inserted rows to the backend, and fold the outcome (inserted / api-sent / failed) into the run's stats + failed_records — move verbatim into a BatchWriter class (ingestors/batch_writer.py). The subtle mid-batch-DB-failure accounting (match inserted-but-unsent records by data_id, not position) and the #99 / api_send_failed surfacing are byte-for-byte unchanged. Attribute names match the ingestor's so the bodies are identical. BaseIngestor composes it via a `_batch_writer` property; _ingest_with_lock's two flush sites call self._batch_writer.flush(...). The unused `session` parameter is threaded through unchanged for parity (a later cleanup can drop it). Tests that drove _process_batch directly now call ing._batch_writer._process(...). Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Completes the P5 decomposition: BaseIngestor (now 643 lines, was 1180) is a thin orchestrator composing Preflight (P5a) + TableLock (P5b) + RecordProcessor (P5c) + BatchWriter (P5d), keeping only ingest()/_ingest_with_lock orchestration + validate_data + create_table/dataset registration. The deferred correctness fixes (mask_id leak, per-row tolerance #235, atomicity #227/#260) now have clean seams to land on as separate reviewed PRs. Stacked on #276 (P5c). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Divya <divyasingh@tracebloc.io>
divyasinghds
added a commit
that referenced
this pull request
Jun 17, 2026
) * refactor(P5b): extract the cross-ingest table lock into TableLock Structural refactor phase P5 (backend#796), god-class decomposition slice 2. The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was three methods + a constant on BaseIngestor. It's a cohesive, stateful responsibility, so it moves verbatim (every error message unchanged) into a `TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a `_table_lock` property. _ingest_with_lock now does `lock = self._table_lock; lock.acquire()` ... `lock.release(path)`. STORAGE_PATH is a Config class constant (not env/override driven), so TableLock reads it from a plain Config() exactly as before — the ~10 lock tests that patch Config.STORAGE_PATH keep working; they now exercise the lock via `ing._table_lock.acquire()` etc. Removed the now-unused `os` and `..config.Config` imports from base.py (their last users moved out in P5a/P5b). Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB / API write path), each behaviour-preserving; the deferred bug fixes (mask_id leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(P5c): extract the per-record transform into RecordProcessor Structural refactor phase P5 (backend#796), god-class decomposition slice 3. process_record + _map_unique_id — turning a raw source row into the cleaned, DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the label after the configured label policy, data_intent, annotation, framework columns) — move verbatim into a RecordProcessor class (ingestors/record_processor.py). The attribute names match the ingestor's so the bodies are byte-for-byte unchanged. BaseIngestor composes it via a `_record_processor` property (built from the run's column / label / intent config). Its public `process_record` stays as a one-line delegate, since the ingest loop and ~26 test call-sites use `self.process_record` / `ing.process_record`; _map_unique_id (no direct test callers) moves fully inside RecordProcessor. Removed the now-unused pandas / Intent / TaskCategory imports from base.py. Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two tests that reached into the old BaseIngestor._map_unique_id now target RecordProcessor (the label-policy hook lives there). The deferred mask_id cross-layer leak is preserved exactly (process still writes mask_id for semantic_segmentation) — untangling it is a follow-up, not this slice. base.py: 1180 -> 774 lines across P5a+P5b+P5c. Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(P5d): extract the batch write path into BatchWriter Structural refactor phase P5 (backend#796), god-class decomposition slice 4 — the final extraction. _flush_batch + _process_batch — insert a batch to MySQL, publish the inserted rows to the backend, and fold the outcome (inserted / api-sent / failed) into the run's stats + failed_records — move verbatim into a BatchWriter class (ingestors/batch_writer.py). The subtle mid-batch-DB-failure accounting (match inserted-but-unsent records by data_id, not position) and the #99 / api_send_failed surfacing are byte-for-byte unchanged. Attribute names match the ingestor's so the bodies are identical. BaseIngestor composes it via a `_batch_writer` property; _ingest_with_lock's two flush sites call self._batch_writer.flush(...). The unused `session` parameter is threaded through unchanged for parity (a later cleanup can drop it). Tests that drove _process_batch directly now call ing._batch_writer._process(...). Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Completes the P5 decomposition: BaseIngestor (now 643 lines, was 1180) is a thin orchestrator composing Preflight (P5a) + TableLock (P5b) + RecordProcessor (P5c) + BatchWriter (P5d), keeping only ingest()/_ingest_with_lock orchestration + validate_data + create_table/dataset registration. The deferred correctness fixes (mask_id leak, per-row tolerance #235, atomicity #227/#260) now have clean seams to land on as separate reviewed PRs. Stacked on #276 (P5c). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(P5): move mask_id off the DB-bound record (sidecar split) Completes the P5 decomposition's last cross-layer loose end. semantic_ segmentation's mask_id is a per-row pointer to a mask FILE, not a table column (there's no mask_id column on the standard table — #212). It used to ride the cleaned, DB-bound record across three layers: RecordProcessor SET it, map_file_transfer READ it, and BatchWriter POPPED it before insert — a fragile indirection where forgetting the pop broke inserts on non-semseg tables. Now the sidecar pointer stays on the RAW source record and map_file_transfer owns its lifecycle: it LENDS mask_id from the raw record to the transfer for the duration of the copy, then STRIPS it before return (in finally — even on a failed / None transfer). So: - RecordProcessor no longer writes mask_id onto the cleaned record (and no longer needs `category` at all — dropped). - BatchWriter no longer pops it (the cleaned record never carries it). - BaseIngestor.ingest passes the raw record as map_file_transfer(..., source_record=record). The cleaned record now holds ONLY table + framework columns — a runtime-only pointer can't reach the DB insert by construction (the #212 class of bug is now unrepresentable, not just guarded). Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real MySQL) 23 passed, 1 xpassed — the semantic_segmentation template still ingests (its mask transfers via the lent pointer; #247 goldens unchanged). Tests that asserted mask_id on the cleaned record / BatchWriter's pop are updated; a new test pins map_file_transfer's lend-then-strip. Stacked on #278 (P5d). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(P5 mask_id): exclude sidecar keys from the cleaned record even when in schema Audit follow-up — self-review caught a behaviour change the first cut smuggled in. It assumed the cleaned record never carries mask_id once RecordProcessor stops adding it, but the shipped semantic_segmentation TEMPLATE lists mask_id in its schema (schema={"mask_id": "VARCHAR(255)"}). So RecordProcessor's schema comprehension KEPT it, map_file_transfer's `key not in record` guard then skipped the strip, and mask_id reached insert_batch — flipping that DB column from NULL (develop: unconditional pop) to populated. The e2e missed it because the harness's semseg case is schema-less. Fix (restores develop's behaviour AND genuinely removes the leak): - SIDECAR_KEYS moves to utils/constants (shared). RecordProcessor excludes it from the cleaned record EVEN when listed in schema, so mask_id is never DB-bound regardless of how a template declares it. - map_file_transfer strips the FULL SIDECAR_KEYS set in `finally` (not only what it lent) — defense-in-depth so a sidecar pointer can never bind as a column. - BatchWriter / RecordProcessor docstrings corrected (the old comment asserted a now-false invariant). Tests: + test_process_record_excludes_mask_id_even_when_in_schema (the leak path) and + test_map_file_transfer_strips_preexisting_sidecar_key (the strip-all defense). Full suite 1082 passed, 96.9%; e2e 23 passed, 1 xpassed (schema-less semseg still transfers via the lend). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Revert "fix(P5 mask_id): exclude sidecar keys from the cleaned record even when in schema" This reverts commit cc50bc3. * fix(P5 mask_id): store mask_id when the schema declares it — it's the client contract Audit conclusion (backend#816): the prior "B1 fix" (reverted in the parent commit) was itself the regression, and so was develop's blanket pop. The training client SELECTs the dataset row from cluster MySQL and does str(row["mask_id"]) to locate each mask file, raising FileNotFoundError if mask_id is missing/NULL (tracebloc-client segmentation_dataset_pytorch.py:98). So semseg REQUIRES mask_id stored — develop NULLed it; excluding it (B1) kept it NULL; both break semseg training. Correct behaviour (restored by the revert): mask_id is KEPT and stored when the template DECLARES it in schema (a real column the client reads); it is lent from the raw record + stripped for the transfer ONLY when not declared (no column). This commit: - Corrects the now-stale docstrings (record_processor / file_transfer map_file_transfer / transfer / batch_writer) that wrongly claimed mask_id is "never on the cleaned record" — the framing that caused the misdiagnosis. - Adds test_process_record_stores_mask_id_when_declared_in_schema (the contract guard that was missing) and reframes the schema-less complement test. Full suite 1081 passed, 96.9%; e2e 23 passed, 1 xpassed. The end-to-end ingestor->client contract is tracked for DS sign-off in backend#816. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Divya <divyasingh@tracebloc.io>
divyasinghds
added a commit
that referenced
this pull request
Jun 17, 2026
…cterization, _refresh_token (#282) * refactor(P5b): extract the cross-ingest table lock into TableLock Structural refactor phase P5 (backend#796), god-class decomposition slice 2. The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was three methods + a constant on BaseIngestor. It's a cohesive, stateful responsibility, so it moves verbatim (every error message unchanged) into a `TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a `_table_lock` property. _ingest_with_lock now does `lock = self._table_lock; lock.acquire()` ... `lock.release(path)`. STORAGE_PATH is a Config class constant (not env/override driven), so TableLock reads it from a plain Config() exactly as before — the ~10 lock tests that patch Config.STORAGE_PATH keep working; they now exercise the lock via `ing._table_lock.acquire()` etc. Removed the now-unused `os` and `..config.Config` imports from base.py (their last users moved out in P5a/P5b). Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB / API write path), each behaviour-preserving; the deferred bug fixes (mask_id leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(P5c): extract the per-record transform into RecordProcessor Structural refactor phase P5 (backend#796), god-class decomposition slice 3. process_record + _map_unique_id — turning a raw source row into the cleaned, DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the label after the configured label policy, data_intent, annotation, framework columns) — move verbatim into a RecordProcessor class (ingestors/record_processor.py). The attribute names match the ingestor's so the bodies are byte-for-byte unchanged. BaseIngestor composes it via a `_record_processor` property (built from the run's column / label / intent config). Its public `process_record` stays as a one-line delegate, since the ingest loop and ~26 test call-sites use `self.process_record` / `ing.process_record`; _map_unique_id (no direct test callers) moves fully inside RecordProcessor. Removed the now-unused pandas / Intent / TaskCategory imports from base.py. Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two tests that reached into the old BaseIngestor._map_unique_id now target RecordProcessor (the label-policy hook lives there). The deferred mask_id cross-layer leak is preserved exactly (process still writes mask_id for semantic_segmentation) — untangling it is a follow-up, not this slice. base.py: 1180 -> 774 lines across P5a+P5b+P5c. Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(P5d): extract the batch write path into BatchWriter Structural refactor phase P5 (backend#796), god-class decomposition slice 4 — the final extraction. _flush_batch + _process_batch — insert a batch to MySQL, publish the inserted rows to the backend, and fold the outcome (inserted / api-sent / failed) into the run's stats + failed_records — move verbatim into a BatchWriter class (ingestors/batch_writer.py). The subtle mid-batch-DB-failure accounting (match inserted-but-unsent records by data_id, not position) and the #99 / api_send_failed surfacing are byte-for-byte unchanged. Attribute names match the ingestor's so the bodies are identical. BaseIngestor composes it via a `_batch_writer` property; _ingest_with_lock's two flush sites call self._batch_writer.flush(...). The unused `session` parameter is threaded through unchanged for parity (a later cleanup can drop it). Tests that drove _process_batch directly now call ing._batch_writer._process(...). Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Completes the P5 decomposition: BaseIngestor (now 643 lines, was 1180) is a thin orchestrator composing Preflight (P5a) + TableLock (P5b) + RecordProcessor (P5c) + BatchWriter (P5d), keeping only ingest()/_ingest_with_lock orchestration + validate_data + create_table/dataset registration. The deferred correctness fixes (mask_id leak, per-row tolerance #235, atomicity #227/#260) now have clean seams to land on as separate reviewed PRs. Stacked on #276 (P5c). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(P5): move mask_id off the DB-bound record (sidecar split) Completes the P5 decomposition's last cross-layer loose end. semantic_ segmentation's mask_id is a per-row pointer to a mask FILE, not a table column (there's no mask_id column on the standard table — #212). It used to ride the cleaned, DB-bound record across three layers: RecordProcessor SET it, map_file_transfer READ it, and BatchWriter POPPED it before insert — a fragile indirection where forgetting the pop broke inserts on non-semseg tables. Now the sidecar pointer stays on the RAW source record and map_file_transfer owns its lifecycle: it LENDS mask_id from the raw record to the transfer for the duration of the copy, then STRIPS it before return (in finally — even on a failed / None transfer). So: - RecordProcessor no longer writes mask_id onto the cleaned record (and no longer needs `category` at all — dropped). - BatchWriter no longer pops it (the cleaned record never carries it). - BaseIngestor.ingest passes the raw record as map_file_transfer(..., source_record=record). The cleaned record now holds ONLY table + framework columns — a runtime-only pointer can't reach the DB insert by construction (the #212 class of bug is now unrepresentable, not just guarded). Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real MySQL) 23 passed, 1 xpassed — the semantic_segmentation template still ingests (its mask transfers via the lent pointer; #247 goldens unchanged). Tests that asserted mask_id on the cleaned record / BatchWriter's pop are updated; a new test pins map_file_transfer's lend-then-strip. Stacked on #278 (P5d). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(P5 mask_id): exclude sidecar keys from the cleaned record even when in schema Audit follow-up — self-review caught a behaviour change the first cut smuggled in. It assumed the cleaned record never carries mask_id once RecordProcessor stops adding it, but the shipped semantic_segmentation TEMPLATE lists mask_id in its schema (schema={"mask_id": "VARCHAR(255)"}). So RecordProcessor's schema comprehension KEPT it, map_file_transfer's `key not in record` guard then skipped the strip, and mask_id reached insert_batch — flipping that DB column from NULL (develop: unconditional pop) to populated. The e2e missed it because the harness's semseg case is schema-less. Fix (restores develop's behaviour AND genuinely removes the leak): - SIDECAR_KEYS moves to utils/constants (shared). RecordProcessor excludes it from the cleaned record EVEN when listed in schema, so mask_id is never DB-bound regardless of how a template declares it. - map_file_transfer strips the FULL SIDECAR_KEYS set in `finally` (not only what it lent) — defense-in-depth so a sidecar pointer can never bind as a column. - BatchWriter / RecordProcessor docstrings corrected (the old comment asserted a now-false invariant). Tests: + test_process_record_excludes_mask_id_even_when_in_schema (the leak path) and + test_map_file_transfer_strips_preexisting_sidecar_key (the strip-all defense). Full suite 1082 passed, 96.9%; e2e 23 passed, 1 xpassed (schema-less semseg still transfers via the lend). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Revert "fix(P5 mask_id): exclude sidecar keys from the cleaned record even when in schema" This reverts commit cc50bc3. * fix(P5 mask_id): store mask_id when the schema declares it — it's the client contract Audit conclusion (backend#816): the prior "B1 fix" (reverted in the parent commit) was itself the regression, and so was develop's blanket pop. The training client SELECTs the dataset row from cluster MySQL and does str(row["mask_id"]) to locate each mask file, raising FileNotFoundError if mask_id is missing/NULL (tracebloc-client segmentation_dataset_pytorch.py:98). So semseg REQUIRES mask_id stored — develop NULLed it; excluding it (B1) kept it NULL; both break semseg training. Correct behaviour (restored by the revert): mask_id is KEPT and stored when the template DECLARES it in schema (a real column the client reads); it is lent from the raw record + stripped for the transfer ONLY when not declared (no column). This commit: - Corrects the now-stale docstrings (record_processor / file_transfer map_file_transfer / transfer / batch_writer) that wrongly claimed mask_id is "never on the cleaned record" — the framing that caused the misdiagnosis. - Adds test_process_record_stores_mask_id_when_declared_in_schema (the contract guard that was missing) and reframes the schema-less complement test. Full suite 1081 passed, 96.9%; e2e 23 passed, 1 xpassed. The end-to-end ingestor->client contract is tracked for DS sign-off in backend#816. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(coverage): close audit gaps — token_classification e2e, semseg characterization, _refresh_token, README Closes the test-coverage gaps the audit surfaced: - token_classification: was in ZERO e2e tests despite a bundled template — now ingests cleanly in test_ingest_e2e (its CSV quotes the extension as '.txt', exercising that path too). - semantic_segmentation: added to the characterization harness — pins the full contract (masks land in DEST_PATH + mask_id stored for the client). It PASSES: the P5 mask_id work wires masks through the declarative path, so #136 is fixed and its stale test_ingest_e2e xfail(strict=False) is removed (the silent-fix gap the audit flagged). The cross-repo client contract is tracked in backend#816. - _refresh_token: the #772 mid-run BACKEND_TOKEN rotation body was only ever exercised with the method STUBBED; added direct tests for the rotation path (env rotates -> True / unchanged -> False) and the creds re-auth path (authenticate -> True / raises -> False). - e2e/README.md: corrected the stale "Known gaps" table (OD #135, MLM #137, semseg #136 all landed) and switched guidance to xfail(strict=True). e2e: 26 passed (was 23 + 1 xpass). Unit: 1085 passed, 97.1% (+4 token tests; api/client _refresh_token now covered). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Divya <divyasingh@tracebloc.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Structural refactor — phase P5 (decompose the
BaseIngestorgod-class), slice 1. Stacked on #273 (P4c).BaseIngestoris ~1180 lines doing everything (preflight, locking, record processing, batch/DB/API writes, orchestration). P5 breaks it into focused collaborators, one safe slice at a time under the e2e gate. This first slice extracts the cheapest, most self-contained cluster: the run-level preflight checks.Three fail-fast guards run once per ingest, before any lock/DB/row work:
check_src_pathSRC_PATH(#772 P2)check_csv_encodingcheck_intentintentthat silently skips every row (#234)They were methods on
BaseIngestorbut are stateless validations of a single input each — so they move to a focusedingestors/preflight.pymodule of functions (not a stateful class).BaseIngestorcalls them directly (preflight.check_src_path(...)etc.); no delegate stubs left behind. base.py drops 1180 → 1070 lines.Behaviour preservation
Moved verbatim — the error-message strings (asserted by tests) are unchanged.
Tests that exercised the checks now call the
preflightfunctions (the bypass patch intest_coverage_gaps2retargetsbase_mod.preflight). Removed the now-unusedpathlib.Pathimport from base.py.What's next in P5
This was the safe pattern-establishing slice. Remaining collaborators, each its own e2e-gated slice:
_table_lock_path/_acquire/_releaselock lifecycle.process_record/_map_unique_id._flush_batch/_process_batch.The deferred bug fixes (the
mask_idcross-layer leak, per-row tolerance #235, atomicity #227/#260) ride the slice that owns the relevant code, rather than landing standalone.🤖 Generated with Claude Code