refactor(P5c): extract the per-record transform into RecordProcessor by LukasWodka · Pull Request #276 · tracebloc/data-ingestors

LukasWodka · 2026-06-16T12:29:42Z

Summary

Structural refactor — phase P5 (decompose the BaseIngestor god-class), slice 3. Stacked on #275 (P5b).

Extracts the per-record transform. process_record + _map_unique_id — turning a raw source row into the cleaned, DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the label after the configured label policy, data_intent, annotation, framework columns) — move verbatim into a RecordProcessor class (ingestors/record_processor.py).

Key risk-reducer: RecordProcessor's attribute names match the ingestor's (self.schema, self.label_column, …), so the method bodies are byte-for-byte unchanged.

BaseIngestor composes it via a _record_processor property (built from the run's column/label/intent config).
Its public process_record stays as a one-line delegate — the ingest loop and ~26 test call-sites use self.process_record / ing.process_record, so keeping it as the ingestor's method (delegating to the collaborator) is the right boundary.
_map_unique_id (no direct test callers) moves fully inside RecordProcessor.
Removed the now-unused pandas / Intent / TaskCategory imports from base.py.

Behaviour preservation

Full unit suite: 1080 passed, coverage 96.8%.
e2e (real MySQL): 23 passed, 1 xpassed — test(e2e): characterization harness pinning clean-ingest behavior per modality #247 characterization goldens unchanged.
Two tests that reached into the old BaseIngestor._map_unique_id now target RecordProcessor (the label-policy hook lives there).
The deferred mask_id cross-layer leak is preserved exactly (process still writes mask_id for semantic_segmentation) — untangling it is a follow-up, not this behaviour-preserving slice.

base.py: 1180 → 774 lines across P5a+P5b+P5c.

What's next in P5

P5d — the batch/DB/API write path (_flush_batch / _process_batch), the last big cluster. It carries the deferred atomicity fixes (Atomicity gap: committed MySQL rows can outlive a failed dataset registration (from backend#772 P0.2) #227/bug: validator-rejected ingest leaves an orphaned table; no rollback blocks the next ingest #260) and per-row tolerance (bug: CSV aborts the whole ingest on one bad cell while JSON silently skips the row — make the policy consistent (follow-up to #189) #235), which land as separate follow-ups on the clean structure.

🤖 Generated with Claude Code

Structural refactor phase P5 (backend#796), god-class decomposition slice 3. process_record + _map_unique_id — turning a raw source row into the cleaned, DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the label after the configured label policy, data_intent, annotation, framework columns) — move verbatim into a RecordProcessor class (ingestors/record_processor.py). The attribute names match the ingestor's so the bodies are byte-for-byte unchanged. BaseIngestor composes it via a `_record_processor` property (built from the run's column / label / intent config). Its public `process_record` stays as a one-line delegate, since the ingest loop and ~26 test call-sites use `self.process_record` / `ing.process_record`; _map_unique_id (no direct test callers) moves fully inside RecordProcessor. Removed the now-unused pandas / Intent / TaskCategory imports from base.py. Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two tests that reached into the old BaseIngestor._map_unique_id now target RecordProcessor (the label-policy hook lives there). The deferred mask_id cross-layer leak is preserved exactly (process still writes mask_id for semantic_segmentation) — untangling it is a follow-up, not this slice. base.py: 1180 -> 774 lines across P5a+P5b+P5c. Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LukasWodka · 2026-06-16T12:30:49Z

👋 Heads-up — Code review queue is at 12 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

averaging-service#115 — Release: staging → main (averaging correctness sweep → production) · author: @aptracebloc · no reviewer assigned
backend#806 — feat(#805): store & distribute contributor tokenizer as a model artifact (Task 1 — backend) · author: @shujaatTracebloc · reviewer: @aptracebloc
backend#807 — Release: staging → master (16.06.2026) · author: @aptracebloc · no reviewer assigned
cli#78 — fix(dataset rm): delete staging files from a uid-65532 pod, not jobs-manager (Release v0.3.10 (ingestion hardening + path-traversal fix + single-label preflight) #259) · author: @LukasWodka · no reviewer assigned
cli#79 — chore(schema): re-sync vendored ingest.v1.json from data-ingestors master · author: @LukasWodka · no reviewer assigned
client-runtime#108 — fix(authz): match ingest table prefixes at a segment boundary (close cross-tenant straddle) · author: @LukasWodka · no reviewer assigned
client-runtime#114 — fix(jobs): cap training Job backoffLimit to stop crashloops starving the cluster · author: @saadqbal · no reviewer assigned
data-ingestors#270 — docs(releasing): correct ingestor rollout — floating tag + imagePullPolicy=Always, not INGESTOR_IMAGE_DIGEST rewrite · author: @saadqbal · no reviewer assigned
data-ingestors#275 — refactor(P5b): extract the cross-ingest table lock into TableLock · author: @LukasWodka · no reviewer assigned
tracebloc-client#376 — fix: throttle per-batch log flooding in training & inference loops (#755) · author: @divyasinghds · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(P5c): extract the per-record transform into RecordProcessor#276

refactor(P5c): extract the per-record transform into RecordProcessor#276
LukasWodka wants to merge 1 commit into
refactor/p5b-extract-table-lockfrom
refactor/p5c-extract-record-processor

LukasWodka commented Jun 16, 2026

Uh oh!

LukasWodka commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LukasWodka commented Jun 16, 2026

Summary

Behaviour preservation

What's next in P5

Uh oh!

LukasWodka commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant