Skip to content

refactor(P5c): extract the per-record transform into RecordProcessor#276

Open
LukasWodka wants to merge 1 commit into
refactor/p5b-extract-table-lockfrom
refactor/p5c-extract-record-processor
Open

refactor(P5c): extract the per-record transform into RecordProcessor#276
LukasWodka wants to merge 1 commit into
refactor/p5b-extract-table-lockfrom
refactor/p5c-extract-record-processor

Conversation

@LukasWodka

Copy link
Copy Markdown
Collaborator

Summary

Structural refactor — phase P5 (decompose the BaseIngestor god-class), slice 3. Stacked on #275 (P5b).

Extracts the per-record transform. process_record + _map_unique_id — turning a raw source row into the cleaned, DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the label after the configured label policy, data_intent, annotation, framework columns) — move verbatim into a RecordProcessor class (ingestors/record_processor.py).

Key risk-reducer: RecordProcessor's attribute names match the ingestor's (self.schema, self.label_column, …), so the method bodies are byte-for-byte unchanged.

  • BaseIngestor composes it via a _record_processor property (built from the run's column/label/intent config).
  • Its public process_record stays as a one-line delegate — the ingest loop and ~26 test call-sites use self.process_record / ing.process_record, so keeping it as the ingestor's method (delegating to the collaborator) is the right boundary.
  • _map_unique_id (no direct test callers) moves fully inside RecordProcessor.
  • Removed the now-unused pandas / Intent / TaskCategory imports from base.py.

Behaviour preservation

  • Full unit suite: 1080 passed, coverage 96.8%.
  • e2e (real MySQL): 23 passed, 1 xpassedtest(e2e): characterization harness pinning clean-ingest behavior per modality #247 characterization goldens unchanged.
  • Two tests that reached into the old BaseIngestor._map_unique_id now target RecordProcessor (the label-policy hook lives there).
  • The deferred mask_id cross-layer leak is preserved exactly (process still writes mask_id for semantic_segmentation) — untangling it is a follow-up, not this behaviour-preserving slice.

base.py: 1180 → 774 lines across P5a+P5b+P5c.

What's next in P5

🤖 Generated with Claude Code

Structural refactor phase P5 (backend#796), god-class decomposition slice 3.

process_record + _map_unique_id — turning a raw source row into the cleaned,
DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the
label after the configured label policy, data_intent, annotation, framework
columns) — move verbatim into a RecordProcessor class
(ingestors/record_processor.py). The attribute names match the ingestor's so
the bodies are byte-for-byte unchanged.

BaseIngestor composes it via a `_record_processor` property (built from the
run's column / label / intent config). Its public `process_record` stays as a
one-line delegate, since the ingest loop and ~26 test call-sites use
`self.process_record` / `ing.process_record`; _map_unique_id (no direct test
callers) moves fully inside RecordProcessor. Removed the now-unused pandas /
Intent / TaskCategory imports from base.py.

Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real
MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two
tests that reached into the old BaseIngestor._map_unique_id now target
RecordProcessor (the label-policy hook lives there). The deferred mask_id
cross-layer leak is preserved exactly (process still writes mask_id for
semantic_segmentation) — untangling it is a follow-up, not this slice.

base.py: 1180 -> 774 lines across P5a+P5b+P5c.

Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@LukasWodka

Copy link
Copy Markdown
Collaborator Author

👋 Heads-up — Code review queue is at 12 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant