Skip to content

feat: introduce BaseExternalParser protocol for pluggable OCR/VLM backends#3207

Open
gavin913-lss wants to merge 7 commits into
HKUDS:mainfrom
gavin913-lss:feat/base-external-parser-protocol
Open

feat: introduce BaseExternalParser protocol for pluggable OCR/VLM backends#3207
gavin913-lss wants to merge 7 commits into
HKUDS:mainfrom
gavin913-lss:feat/base-external-parser-protocol

Conversation

@gavin913-lss
Copy link
Copy Markdown

Problem

Every new OCR/VLM parser engine copies the same four-file structure and adds another if engine == … branch in the pipeline. No shared contract exists between engines.

Fixes #3197

Solution

1. BaseExternalParser protocol (_base.py)

Abstract base class with three methods:

  • is_bundle_valid(raw_dir, source_path) — cache-hit check
  • download_into(raw_dir, source_path) — fetch raw bundle from external service
  • build_ir(raw_dir, document_name) — convert raw bundle to IRDoc

2. Engine registry (_registry.py)

Dict mapping engine names to parser classes:

  • register_parser("mineru", MinerUParser)
  • register_parser("docling", DoclingParser)

New engines register here instead of adding pipeline branches.

3. MinerU adapter (mineru/adapter.py)

Wraps existing MinerURawClient + MinerUIRBuilder + is_bundle_valid behind the protocol.

4. Docling adapter (docling/adapter.py)

Wraps existing DoclingRawClient + DoclingIRBuilder + is_bundle_valid behind the protocol.

What this does NOT change

  • No behavior change for existing MinerU/Docling usage
  • No changes to on-disk cache format or IRDoc schema
  • No changes to pipeline.py dispatch logic (that's a follow-up)
  • The existing parse_mineru/parse_docling methods continue to work

Next steps (follow-up PRs)

  1. Add a generic parse_external method to the pipeline that uses the registry
  2. New engines (PaddleOCR-VL, DeepSeek-OCR, etc.) implement the protocol
  3. Eventually remove the if engine == … chain from the pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: introduce a BaseExternalParser protocol for pluggable OCR/VLM backends

1 participant