Skip to content

refactor(P5b): extract the cross-ingest table lock into TableLock#275

Merged
divyasinghds merged 1 commit into
developfrom
refactor/p5b-extract-table-lock
Jun 17, 2026
Merged

refactor(P5b): extract the cross-ingest table lock into TableLock#275
divyasinghds merged 1 commit into
developfrom
refactor/p5b-extract-table-lock

Conversation

@LukasWodka

Copy link
Copy Markdown
Collaborator

Summary

Structural refactor — phase P5 (decompose the BaseIngestor god-class), slice 2. P5a (#274) is merged, so this targets develop directly.

Extracts the cross-ingest table lock (#772 P2). The lock lifecycle (compute path under STORAGE_PATH, atomic O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release) was three methods + a constant on BaseIngestor. It's a stateful, self-contained responsibility, so it moves verbatim (every error message unchanged) into a TableLock class (ingestors/table_lock.py) that BaseIngestor composes via a _table_lock property.

_ingest_with_lock now reads:

lock = self._table_lock
_lock_path = lock.acquire()
try:
    return self._ingest_with_lock(source, batch_size)
finally:
    lock.release(_lock_path)

Behaviour preservation

  • STORAGE_PATH is a Config class constant (not env/override driven), so TableLock reads it from a plain Config() exactly as the old code did — the ~10 lock tests that patch.object(Config, "STORAGE_PATH", …) keep working unchanged. They now exercise the lock via ing._table_lock.acquire() / .release() / .path() (a clean rename).
  • Removed the now-unused os and ..config.Config imports from base.py (their last users moved out in P5a/P5b).

Verification:

base.py: 1180 → 939 lines across P5a+P5b.

What's next in P5

  • P5cRecordProcessor (process_record / _map_unique_id).
  • P5d — the batch/DB/API write path (_flush_batch / _process_batch).

Each behaviour-preserving. The deferred bug fixes (mask_id cross-layer leak, per-row tolerance #235, atomicity #227/#260) ride later focused slices on top of the clean structure — keeping a god-class teardown out of the same diff as subtle correctness changes.

🤖 Generated with Claude Code

Structural refactor phase P5 (backend#796), god-class decomposition slice 2.

The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic
O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was
three methods + a constant on BaseIngestor. It's a cohesive, stateful
responsibility, so it moves verbatim (every error message unchanged) into a
`TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a
`_table_lock` property. _ingest_with_lock now does
`lock = self._table_lock; lock.acquire()` ... `lock.release(path)`.

STORAGE_PATH is a Config class constant (not env/override driven), so TableLock
reads it from a plain Config() exactly as before — the ~10 lock tests that
patch Config.STORAGE_PATH keep working; they now exercise the lock via
`ing._table_lock.acquire()` etc. Removed the now-unused `os` and
`..config.Config` imports from base.py (their last users moved out in P5a/P5b).

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged.

Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB /
API write path), each behaviour-preserving; the deferred bug fixes (mask_id
leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Collaborator Author

👋 Heads-up — Code review queue is at 11 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

@divyasinghds divyasinghds merged commit 1028997 into develop Jun 17, 2026
6 checks passed
@divyasinghds divyasinghds deleted the refactor/p5b-extract-table-lock branch June 17, 2026 05:46
divyasinghds added a commit that referenced this pull request Jun 17, 2026
…276)

* refactor(P5b): extract the cross-ingest table lock into TableLock

Structural refactor phase P5 (backend#796), god-class decomposition slice 2.

The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic
O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was
three methods + a constant on BaseIngestor. It's a cohesive, stateful
responsibility, so it moves verbatim (every error message unchanged) into a
`TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a
`_table_lock` property. _ingest_with_lock now does
`lock = self._table_lock; lock.acquire()` ... `lock.release(path)`.

STORAGE_PATH is a Config class constant (not env/override driven), so TableLock
reads it from a plain Config() exactly as before — the ~10 lock tests that
patch Config.STORAGE_PATH keep working; they now exercise the lock via
`ing._table_lock.acquire()` etc. Removed the now-unused `os` and
`..config.Config` imports from base.py (their last users moved out in P5a/P5b).

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged.

Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB /
API write path), each behaviour-preserving; the deferred bug fixes (mask_id
leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(P5c): extract the per-record transform into RecordProcessor

Structural refactor phase P5 (backend#796), god-class decomposition slice 3.

process_record + _map_unique_id — turning a raw source row into the cleaned,
DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the
label after the configured label policy, data_intent, annotation, framework
columns) — move verbatim into a RecordProcessor class
(ingestors/record_processor.py). The attribute names match the ingestor's so
the bodies are byte-for-byte unchanged.

BaseIngestor composes it via a `_record_processor` property (built from the
run's column / label / intent config). Its public `process_record` stays as a
one-line delegate, since the ingest loop and ~26 test call-sites use
`self.process_record` / `ing.process_record`; _map_unique_id (no direct test
callers) moves fully inside RecordProcessor. Removed the now-unused pandas /
Intent / TaskCategory imports from base.py.

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two
tests that reached into the old BaseIngestor._map_unique_id now target
RecordProcessor (the label-policy hook lives there). The deferred mask_id
cross-layer leak is preserved exactly (process still writes mask_id for
semantic_segmentation) — untangling it is a follow-up, not this slice.

base.py: 1180 -> 774 lines across P5a+P5b+P5c.

Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Divya <divyasingh@tracebloc.io>
divyasinghds added a commit that referenced this pull request Jun 17, 2026
* refactor(P5b): extract the cross-ingest table lock into TableLock

Structural refactor phase P5 (backend#796), god-class decomposition slice 2.

The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic
O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was
three methods + a constant on BaseIngestor. It's a cohesive, stateful
responsibility, so it moves verbatim (every error message unchanged) into a
`TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a
`_table_lock` property. _ingest_with_lock now does
`lock = self._table_lock; lock.acquire()` ... `lock.release(path)`.

STORAGE_PATH is a Config class constant (not env/override driven), so TableLock
reads it from a plain Config() exactly as before — the ~10 lock tests that
patch Config.STORAGE_PATH keep working; they now exercise the lock via
`ing._table_lock.acquire()` etc. Removed the now-unused `os` and
`..config.Config` imports from base.py (their last users moved out in P5a/P5b).

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged.

Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB /
API write path), each behaviour-preserving; the deferred bug fixes (mask_id
leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(P5c): extract the per-record transform into RecordProcessor

Structural refactor phase P5 (backend#796), god-class decomposition slice 3.

process_record + _map_unique_id — turning a raw source row into the cleaned,
DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the
label after the configured label policy, data_intent, annotation, framework
columns) — move verbatim into a RecordProcessor class
(ingestors/record_processor.py). The attribute names match the ingestor's so
the bodies are byte-for-byte unchanged.

BaseIngestor composes it via a `_record_processor` property (built from the
run's column / label / intent config). Its public `process_record` stays as a
one-line delegate, since the ingest loop and ~26 test call-sites use
`self.process_record` / `ing.process_record`; _map_unique_id (no direct test
callers) moves fully inside RecordProcessor. Removed the now-unused pandas /
Intent / TaskCategory imports from base.py.

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two
tests that reached into the old BaseIngestor._map_unique_id now target
RecordProcessor (the label-policy hook lives there). The deferred mask_id
cross-layer leak is preserved exactly (process still writes mask_id for
semantic_segmentation) — untangling it is a follow-up, not this slice.

base.py: 1180 -> 774 lines across P5a+P5b+P5c.

Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(P5d): extract the batch write path into BatchWriter

Structural refactor phase P5 (backend#796), god-class decomposition slice 4 —
the final extraction.

_flush_batch + _process_batch — insert a batch to MySQL, publish the inserted
rows to the backend, and fold the outcome (inserted / api-sent / failed) into
the run's stats + failed_records — move verbatim into a BatchWriter class
(ingestors/batch_writer.py). The subtle mid-batch-DB-failure accounting (match
inserted-but-unsent records by data_id, not position) and the #99 /
api_send_failed surfacing are byte-for-byte unchanged. Attribute names match
the ingestor's so the bodies are identical.

BaseIngestor composes it via a `_batch_writer` property; _ingest_with_lock's
two flush sites call self._batch_writer.flush(...). The unused `session`
parameter is threaded through unchanged for parity (a later cleanup can drop
it). Tests that drove _process_batch directly now call
ing._batch_writer._process(...).

Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged.

Completes the P5 decomposition: BaseIngestor (now 643 lines, was 1180) is a thin
orchestrator composing Preflight (P5a) + TableLock (P5b) + RecordProcessor (P5c)
+ BatchWriter (P5d), keeping only ingest()/_ingest_with_lock orchestration +
validate_data + create_table/dataset registration. The deferred correctness
fixes (mask_id leak, per-row tolerance #235, atomicity #227/#260) now have clean
seams to land on as separate reviewed PRs.

Stacked on #276 (P5c).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Divya <divyasingh@tracebloc.io>
divyasinghds added a commit that referenced this pull request Jun 17, 2026
)

* refactor(P5b): extract the cross-ingest table lock into TableLock

Structural refactor phase P5 (backend#796), god-class decomposition slice 2.

The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic
O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was
three methods + a constant on BaseIngestor. It's a cohesive, stateful
responsibility, so it moves verbatim (every error message unchanged) into a
`TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a
`_table_lock` property. _ingest_with_lock now does
`lock = self._table_lock; lock.acquire()` ... `lock.release(path)`.

STORAGE_PATH is a Config class constant (not env/override driven), so TableLock
reads it from a plain Config() exactly as before — the ~10 lock tests that
patch Config.STORAGE_PATH keep working; they now exercise the lock via
`ing._table_lock.acquire()` etc. Removed the now-unused `os` and
`..config.Config` imports from base.py (their last users moved out in P5a/P5b).

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged.

Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB /
API write path), each behaviour-preserving; the deferred bug fixes (mask_id
leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(P5c): extract the per-record transform into RecordProcessor

Structural refactor phase P5 (backend#796), god-class decomposition slice 3.

process_record + _map_unique_id — turning a raw source row into the cleaned,
DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the
label after the configured label policy, data_intent, annotation, framework
columns) — move verbatim into a RecordProcessor class
(ingestors/record_processor.py). The attribute names match the ingestor's so
the bodies are byte-for-byte unchanged.

BaseIngestor composes it via a `_record_processor` property (built from the
run's column / label / intent config). Its public `process_record` stays as a
one-line delegate, since the ingest loop and ~26 test call-sites use
`self.process_record` / `ing.process_record`; _map_unique_id (no direct test
callers) moves fully inside RecordProcessor. Removed the now-unused pandas /
Intent / TaskCategory imports from base.py.

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two
tests that reached into the old BaseIngestor._map_unique_id now target
RecordProcessor (the label-policy hook lives there). The deferred mask_id
cross-layer leak is preserved exactly (process still writes mask_id for
semantic_segmentation) — untangling it is a follow-up, not this slice.

base.py: 1180 -> 774 lines across P5a+P5b+P5c.

Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(P5d): extract the batch write path into BatchWriter

Structural refactor phase P5 (backend#796), god-class decomposition slice 4 —
the final extraction.

_flush_batch + _process_batch — insert a batch to MySQL, publish the inserted
rows to the backend, and fold the outcome (inserted / api-sent / failed) into
the run's stats + failed_records — move verbatim into a BatchWriter class
(ingestors/batch_writer.py). The subtle mid-batch-DB-failure accounting (match
inserted-but-unsent records by data_id, not position) and the #99 /
api_send_failed surfacing are byte-for-byte unchanged. Attribute names match
the ingestor's so the bodies are identical.

BaseIngestor composes it via a `_batch_writer` property; _ingest_with_lock's
two flush sites call self._batch_writer.flush(...). The unused `session`
parameter is threaded through unchanged for parity (a later cleanup can drop
it). Tests that drove _process_batch directly now call
ing._batch_writer._process(...).

Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged.

Completes the P5 decomposition: BaseIngestor (now 643 lines, was 1180) is a thin
orchestrator composing Preflight (P5a) + TableLock (P5b) + RecordProcessor (P5c)
+ BatchWriter (P5d), keeping only ingest()/_ingest_with_lock orchestration +
validate_data + create_table/dataset registration. The deferred correctness
fixes (mask_id leak, per-row tolerance #235, atomicity #227/#260) now have clean
seams to land on as separate reviewed PRs.

Stacked on #276 (P5c).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(P5): move mask_id off the DB-bound record (sidecar split)

Completes the P5 decomposition's last cross-layer loose end. semantic_
segmentation's mask_id is a per-row pointer to a mask FILE, not a table column
(there's no mask_id column on the standard table — #212). It used to ride the
cleaned, DB-bound record across three layers: RecordProcessor SET it,
map_file_transfer READ it, and BatchWriter POPPED it before insert — a fragile
indirection where forgetting the pop broke inserts on non-semseg tables.

Now the sidecar pointer stays on the RAW source record and map_file_transfer
owns its lifecycle: it LENDS mask_id from the raw record to the transfer for
the duration of the copy, then STRIPS it before return (in finally — even on a
failed / None transfer). So:
- RecordProcessor no longer writes mask_id onto the cleaned record (and no
  longer needs `category` at all — dropped).
- BatchWriter no longer pops it (the cleaned record never carries it).
- BaseIngestor.ingest passes the raw record as
  map_file_transfer(..., source_record=record).

The cleaned record now holds ONLY table + framework columns — a runtime-only
pointer can't reach the DB insert by construction (the #212 class of bug is
now unrepresentable, not just guarded).

Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — the semantic_segmentation template still ingests
(its mask transfers via the lent pointer; #247 goldens unchanged). Tests that
asserted mask_id on the cleaned record / BatchWriter's pop are updated; a new
test pins map_file_transfer's lend-then-strip.

Stacked on #278 (P5d).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(P5 mask_id): exclude sidecar keys from the cleaned record even when in schema

Audit follow-up — self-review caught a behaviour change the first cut smuggled
in. It assumed the cleaned record never carries mask_id once RecordProcessor
stops adding it, but the shipped semantic_segmentation TEMPLATE lists mask_id in
its schema (schema={"mask_id": "VARCHAR(255)"}). So RecordProcessor's schema
comprehension KEPT it, map_file_transfer's `key not in record` guard then
skipped the strip, and mask_id reached insert_batch — flipping that DB column
from NULL (develop: unconditional pop) to populated. The e2e missed it because
the harness's semseg case is schema-less.

Fix (restores develop's behaviour AND genuinely removes the leak):
- SIDECAR_KEYS moves to utils/constants (shared). RecordProcessor excludes it
  from the cleaned record EVEN when listed in schema, so mask_id is never
  DB-bound regardless of how a template declares it.
- map_file_transfer strips the FULL SIDECAR_KEYS set in `finally` (not only what
  it lent) — defense-in-depth so a sidecar pointer can never bind as a column.
- BatchWriter / RecordProcessor docstrings corrected (the old comment asserted a
  now-false invariant).

Tests: + test_process_record_excludes_mask_id_even_when_in_schema (the leak path)
and + test_map_file_transfer_strips_preexisting_sidecar_key (the strip-all
defense). Full suite 1082 passed, 96.9%; e2e 23 passed, 1 xpassed (schema-less
semseg still transfers via the lend).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Revert "fix(P5 mask_id): exclude sidecar keys from the cleaned record even when in schema"

This reverts commit cc50bc3.

* fix(P5 mask_id): store mask_id when the schema declares it — it's the client contract

Audit conclusion (backend#816): the prior "B1 fix" (reverted in the parent
commit) was itself the regression, and so was develop's blanket pop. The
training client SELECTs the dataset row from cluster MySQL and does
str(row["mask_id"]) to locate each mask file, raising FileNotFoundError if
mask_id is missing/NULL (tracebloc-client segmentation_dataset_pytorch.py:98).
So semseg REQUIRES mask_id stored — develop NULLed it; excluding it (B1) kept
it NULL; both break semseg training.

Correct behaviour (restored by the revert): mask_id is KEPT and stored when the
template DECLARES it in schema (a real column the client reads); it is lent from
the raw record + stripped for the transfer ONLY when not declared (no column).

This commit:
- Corrects the now-stale docstrings (record_processor / file_transfer
  map_file_transfer / transfer / batch_writer) that wrongly claimed mask_id is
  "never on the cleaned record" — the framing that caused the misdiagnosis.
- Adds test_process_record_stores_mask_id_when_declared_in_schema (the contract
  guard that was missing) and reframes the schema-less complement test.

Full suite 1081 passed, 96.9%; e2e 23 passed, 1 xpassed. The end-to-end
ingestor->client contract is tracked for DS sign-off in backend#816.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Divya <divyasingh@tracebloc.io>
divyasinghds added a commit that referenced this pull request Jun 17, 2026
…cterization, _refresh_token (#282)

* refactor(P5b): extract the cross-ingest table lock into TableLock

Structural refactor phase P5 (backend#796), god-class decomposition slice 2.

The file-based table lock (#772 P2) — compute path under STORAGE_PATH, atomic
O_EXCL acquire with stale-reclaim + corrupt-lock mtime fallback, release — was
three methods + a constant on BaseIngestor. It's a cohesive, stateful
responsibility, so it moves verbatim (every error message unchanged) into a
`TableLock` class (ingestors/table_lock.py) that BaseIngestor composes via a
`_table_lock` property. _ingest_with_lock now does
`lock = self._table_lock; lock.acquire()` ... `lock.release(path)`.

STORAGE_PATH is a Config class constant (not env/override driven), so TableLock
reads it from a plain Config() exactly as before — the ~10 lock tests that
patch Config.STORAGE_PATH keep working; they now exercise the lock via
`ing._table_lock.acquire()` etc. Removed the now-unused `os` and
`..config.Config` imports from base.py (their last users moved out in P5a/P5b).

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged.

Stacked on #274 (P5a). Next: P5c (RecordProcessor) and P5d (the batch / DB /
API write path), each behaviour-preserving; the deferred bug fixes (mask_id
leak, per-row tolerance #235, atomicity #227 / #260) ride later focused slices.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(P5c): extract the per-record transform into RecordProcessor

Structural refactor phase P5 (backend#796), god-class decomposition slice 3.

process_record + _map_unique_id — turning a raw source row into the cleaned,
DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the
label after the configured label policy, data_intent, annotation, framework
columns) — move verbatim into a RecordProcessor class
(ingestors/record_processor.py). The attribute names match the ingestor's so
the bodies are byte-for-byte unchanged.

BaseIngestor composes it via a `_record_processor` property (built from the
run's column / label / intent config). Its public `process_record` stays as a
one-line delegate, since the ingest loop and ~26 test call-sites use
`self.process_record` / `ing.process_record`; _map_unique_id (no direct test
callers) moves fully inside RecordProcessor. Removed the now-unused pandas /
Intent / TaskCategory imports from base.py.

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two
tests that reached into the old BaseIngestor._map_unique_id now target
RecordProcessor (the label-policy hook lives there). The deferred mask_id
cross-layer leak is preserved exactly (process still writes mask_id for
semantic_segmentation) — untangling it is a follow-up, not this slice.

base.py: 1180 -> 774 lines across P5a+P5b+P5c.

Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(P5d): extract the batch write path into BatchWriter

Structural refactor phase P5 (backend#796), god-class decomposition slice 4 —
the final extraction.

_flush_batch + _process_batch — insert a batch to MySQL, publish the inserted
rows to the backend, and fold the outcome (inserted / api-sent / failed) into
the run's stats + failed_records — move verbatim into a BatchWriter class
(ingestors/batch_writer.py). The subtle mid-batch-DB-failure accounting (match
inserted-but-unsent records by data_id, not position) and the #99 /
api_send_failed surfacing are byte-for-byte unchanged. Attribute names match
the ingestor's so the bodies are identical.

BaseIngestor composes it via a `_batch_writer` property; _ingest_with_lock's
two flush sites call self._batch_writer.flush(...). The unused `session`
parameter is threaded through unchanged for parity (a later cleanup can drop
it). Tests that drove _process_batch directly now call
ing._batch_writer._process(...).

Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged.

Completes the P5 decomposition: BaseIngestor (now 643 lines, was 1180) is a thin
orchestrator composing Preflight (P5a) + TableLock (P5b) + RecordProcessor (P5c)
+ BatchWriter (P5d), keeping only ingest()/_ingest_with_lock orchestration +
validate_data + create_table/dataset registration. The deferred correctness
fixes (mask_id leak, per-row tolerance #235, atomicity #227/#260) now have clean
seams to land on as separate reviewed PRs.

Stacked on #276 (P5c).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(P5): move mask_id off the DB-bound record (sidecar split)

Completes the P5 decomposition's last cross-layer loose end. semantic_
segmentation's mask_id is a per-row pointer to a mask FILE, not a table column
(there's no mask_id column on the standard table — #212). It used to ride the
cleaned, DB-bound record across three layers: RecordProcessor SET it,
map_file_transfer READ it, and BatchWriter POPPED it before insert — a fragile
indirection where forgetting the pop broke inserts on non-semseg tables.

Now the sidecar pointer stays on the RAW source record and map_file_transfer
owns its lifecycle: it LENDS mask_id from the raw record to the transfer for
the duration of the copy, then STRIPS it before return (in finally — even on a
failed / None transfer). So:
- RecordProcessor no longer writes mask_id onto the cleaned record (and no
  longer needs `category` at all — dropped).
- BatchWriter no longer pops it (the cleaned record never carries it).
- BaseIngestor.ingest passes the raw record as
  map_file_transfer(..., source_record=record).

The cleaned record now holds ONLY table + framework columns — a runtime-only
pointer can't reach the DB insert by construction (the #212 class of bug is
now unrepresentable, not just guarded).

Behaviour-preserving: full unit suite 1080 passed, 96.9% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — the semantic_segmentation template still ingests
(its mask transfers via the lent pointer; #247 goldens unchanged). Tests that
asserted mask_id on the cleaned record / BatchWriter's pop are updated; a new
test pins map_file_transfer's lend-then-strip.

Stacked on #278 (P5d).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(P5 mask_id): exclude sidecar keys from the cleaned record even when in schema

Audit follow-up — self-review caught a behaviour change the first cut smuggled
in. It assumed the cleaned record never carries mask_id once RecordProcessor
stops adding it, but the shipped semantic_segmentation TEMPLATE lists mask_id in
its schema (schema={"mask_id": "VARCHAR(255)"}). So RecordProcessor's schema
comprehension KEPT it, map_file_transfer's `key not in record` guard then
skipped the strip, and mask_id reached insert_batch — flipping that DB column
from NULL (develop: unconditional pop) to populated. The e2e missed it because
the harness's semseg case is schema-less.

Fix (restores develop's behaviour AND genuinely removes the leak):
- SIDECAR_KEYS moves to utils/constants (shared). RecordProcessor excludes it
  from the cleaned record EVEN when listed in schema, so mask_id is never
  DB-bound regardless of how a template declares it.
- map_file_transfer strips the FULL SIDECAR_KEYS set in `finally` (not only what
  it lent) — defense-in-depth so a sidecar pointer can never bind as a column.
- BatchWriter / RecordProcessor docstrings corrected (the old comment asserted a
  now-false invariant).

Tests: + test_process_record_excludes_mask_id_even_when_in_schema (the leak path)
and + test_map_file_transfer_strips_preexisting_sidecar_key (the strip-all
defense). Full suite 1082 passed, 96.9%; e2e 23 passed, 1 xpassed (schema-less
semseg still transfers via the lend).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Revert "fix(P5 mask_id): exclude sidecar keys from the cleaned record even when in schema"

This reverts commit cc50bc3.

* fix(P5 mask_id): store mask_id when the schema declares it — it's the client contract

Audit conclusion (backend#816): the prior "B1 fix" (reverted in the parent
commit) was itself the regression, and so was develop's blanket pop. The
training client SELECTs the dataset row from cluster MySQL and does
str(row["mask_id"]) to locate each mask file, raising FileNotFoundError if
mask_id is missing/NULL (tracebloc-client segmentation_dataset_pytorch.py:98).
So semseg REQUIRES mask_id stored — develop NULLed it; excluding it (B1) kept
it NULL; both break semseg training.

Correct behaviour (restored by the revert): mask_id is KEPT and stored when the
template DECLARES it in schema (a real column the client reads); it is lent from
the raw record + stripped for the transfer ONLY when not declared (no column).

This commit:
- Corrects the now-stale docstrings (record_processor / file_transfer
  map_file_transfer / transfer / batch_writer) that wrongly claimed mask_id is
  "never on the cleaned record" — the framing that caused the misdiagnosis.
- Adds test_process_record_stores_mask_id_when_declared_in_schema (the contract
  guard that was missing) and reframes the schema-less complement test.

Full suite 1081 passed, 96.9%; e2e 23 passed, 1 xpassed. The end-to-end
ingestor->client contract is tracked for DS sign-off in backend#816.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(coverage): close audit gaps — token_classification e2e, semseg characterization, _refresh_token, README

Closes the test-coverage gaps the audit surfaced:

- token_classification: was in ZERO e2e tests despite a bundled template — now
  ingests cleanly in test_ingest_e2e (its CSV quotes the extension as '.txt',
  exercising that path too).
- semantic_segmentation: added to the characterization harness — pins the full
  contract (masks land in DEST_PATH + mask_id stored for the client). It PASSES:
  the P5 mask_id work wires masks through the declarative path, so #136 is fixed
  and its stale test_ingest_e2e xfail(strict=False) is removed (the silent-fix
  gap the audit flagged). The cross-repo client contract is tracked in
  backend#816.
- _refresh_token: the #772 mid-run BACKEND_TOKEN rotation body was only ever
  exercised with the method STUBBED; added direct tests for the rotation path
  (env rotates -> True / unchanged -> False) and the creds re-auth path
  (authenticate -> True / raises -> False).
- e2e/README.md: corrected the stale "Known gaps" table (OD #135, MLM #137,
  semseg #136 all landed) and switched guidance to xfail(strict=True).

e2e: 26 passed (was 23 + 1 xpass). Unit: 1085 passed, 97.1% (+4 token tests;
api/client _refresh_token now covered).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Divya <divyasingh@tracebloc.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants