refactor(P5c): extract the per-record transform into RecordProcessor#276
Open
LukasWodka wants to merge 1 commit into
Open
refactor(P5c): extract the per-record transform into RecordProcessor#276LukasWodka wants to merge 1 commit into
LukasWodka wants to merge 1 commit into
Conversation
Structural refactor phase P5 (backend#796), god-class decomposition slice 3. process_record + _map_unique_id — turning a raw source row into the cleaned, DB-ready dict (schema-filtered + NA-normalised columns, resolved data_id, the label after the configured label policy, data_intent, annotation, framework columns) — move verbatim into a RecordProcessor class (ingestors/record_processor.py). The attribute names match the ingestor's so the bodies are byte-for-byte unchanged. BaseIngestor composes it via a `_record_processor` property (built from the run's column / label / intent config). Its public `process_record` stays as a one-line delegate, since the ingest loop and ~26 test call-sites use `self.process_record` / `ing.process_record`; _map_unique_id (no direct test callers) moves fully inside RecordProcessor. Removed the now-unused pandas / Intent / TaskCategory imports from base.py. Behaviour-preserving: full unit suite 1080 passed, 96.8% coverage; e2e (real MySQL) 23 passed, 1 xpassed — #247 characterization goldens unchanged. Two tests that reached into the old BaseIngestor._map_unique_id now target RecordProcessor (the label-policy hook lives there). The deferred mask_id cross-layer leak is preserved exactly (process still writes mask_id for semantic_segmentation) — untangling it is a follow-up, not this slice. base.py: 1180 -> 774 lines across P5a+P5b+P5c. Stacked on #275 (P5b). Next: P5d (the batch / DB / API write path). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
👋 Heads-up — Code review queue is at 12 / 8 Above the WIP limit. The team convention is to review existing PRs before opening new work. Open PRs currently in Code review (oldest first):
Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.) |
This was referenced Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Structural refactor — phase P5 (decompose the
BaseIngestorgod-class), slice 3. Stacked on #275 (P5b).Extracts the per-record transform.
process_record+_map_unique_id— turning a raw source row into the cleaned, DB-ready dict (schema-filtered + NA-normalised columns, resolveddata_id, the label after the configured label policy,data_intent, annotation, framework columns) — move verbatim into aRecordProcessorclass (ingestors/record_processor.py).Key risk-reducer: RecordProcessor's attribute names match the ingestor's (
self.schema,self.label_column, …), so the method bodies are byte-for-byte unchanged.BaseIngestorcomposes it via a_record_processorproperty (built from the run's column/label/intent config).process_recordstays as a one-line delegate — the ingest loop and ~26 test call-sites useself.process_record/ing.process_record, so keeping it as the ingestor's method (delegating to the collaborator) is the right boundary._map_unique_id(no direct test callers) moves fully inside RecordProcessor.pandas/Intent/TaskCategoryimports from base.py.Behaviour preservation
BaseIngestor._map_unique_idnow targetRecordProcessor(the label-policy hook lives there).mask_idcross-layer leak is preserved exactly (processstill writesmask_idforsemantic_segmentation) — untangling it is a follow-up, not this behaviour-preserving slice.base.py: 1180 → 774 lines across P5a+P5b+P5c.
What's next in P5
_flush_batch/_process_batch), the last big cluster. It carries the deferred atomicity fixes (Atomicity gap: committed MySQL rows can outlive a failed dataset registration (from backend#772 P0.2) #227/bug: validator-rejected ingest leaves an orphaned table; no rollback blocks the next ingest #260) and per-row tolerance (bug: CSV aborts the whole ingest on one bad cell while JSON silently skips the row — make the policy consistent (follow-up to #189) #235), which land as separate follow-ups on the clean structure.🤖 Generated with Claude Code