diff --git a/.gitignore b/.gitignore index 15613ea8a..dc473bfb9 100644 --- a/.gitignore +++ b/.gitignore @@ -166,3 +166,9 @@ cython_debug/ src/.DS_Store .DS_Store .cursorrules + +# Local secrets (never commit) +.secrets.local +*.secrets +.env.local +test-data/ diff --git a/docs/distribution-and-publishing.md b/docs/distribution-and-publishing.md new file mode 100644 index 000000000..c2240b0db --- /dev/null +++ b/docs/distribution-and-publishing.md @@ -0,0 +1,679 @@ +# MarkItDown 分发与发布方案 + +## 背景 + +本地 fork 版本包含两个核心包: +- **markitdown** `0.1.6b2`(官方 PyPI 最新为 `0.1.5`) +- **markitdown-glmocr** `0.1.0`(PyPI 上不存在,纯本地新增插件) + +目标:让其他人能方便使用包含 glmocr 插件的 markitdown,不依赖官方是否合并 PR。 + +--- + +## 方案总览 + +| 方案 | 适用场景 | 用户体验 | 维护成本 | 分发方式 | +|------|---------|----------|---------|---------| +| **A. PyPI 独立发布** | 面向 Python 开发者 | `pip install` 即用 | 低 | PyPI | +| **B. Pyx 打包独立可执行文件** | 面向非技术用户 | 双击/命令行直接运行 | 中 | GitHub Releases | +| **C. Docker 镜像** | 服务端/CI 场景 | `docker run` 即用 | 低 | Docker Hub / GHCR | +| **D. 混合方案(推荐)** | 覆盖所有场景 | 按需选择 | 中 | PyPI + GitHub Releases | + +--- + +## 方案 A:PyPI 独立发布(推荐优先执行) + +### 核心思路 + +不改动 `markitdown` 主包名,仅将 `markitdown-glmocr` 发布到 PyPI。用户安装方式: + +```bash +pip install markitdown[all] markitdown-glmocr[glmocr] +``` + +使用时加 `-p` 参数启用插件: + +```bash +markitdown -p document.pdf +``` + +### 为什么不 fork 一个 `markitdown-glmocr-all` 包? + +1. `markitdown` 的插件机制(entry_points)已经设计好,`markitdown-glmocr` 作为插件包完全解耦 +2. 避免维护 markitdown 核心代码的 fork 副本 +3. 官方更新 markitdown 核心时,用户直接 `pip install -U markitdown` 即可升级 + +### 详细步骤 + +#### 1. 修改 `markitdown-glmocr` 的 pyproject.toml + +```toml +[project] +name = "markitdown-glmocr" +version = "0.1.0" # 改为静态版本,首次发布不用 dynamic +description = "Intelligent PDF/Image to Markdown converter using GLM-OCR SDK" +readme = "README.md" +requires-python = ">=3.10" +license = "MIT" +authors = [ + { name = "Your Name", email = "your@email.com" }, +] + +# 关键:声明对 markitdown 的版本范围依赖 +dependencies = [ + "markitdown>=0.1.0,<1.0.0", + "pdfminer.six>=20251230", + "pdfplumber>=0.11.9", + "Pillow>=9.0.0", +] + +[project.optional-dependencies] +glmocr = ["glmocr>=0.1.0"] +all = [ + "glmocr>=0.1.0", + "markitdown[all]", +] +dev = ["pytest>=7.0.0", "build", "twine"] + +# 插件入口点(已有,无需修改) +[project.entry-points."markitdown.plugin"] +markitdown_glmocr = "markitdown_glmocr" +``` + +#### 2. 编写 README.md + +在 `packages/markitdown-glmocr/` 下创建完善的 README: + +```markdown +# markitdown-glmocr + +Intelligent PDF/Image to Markdown converter plugin for [markitdown](https://github.com/microsoft/markitdown), +powered by [GLM-OCR](https://github.com/zai-org/glm-ocr) SDK. + +## Installation + +pip install markitdown-glmocr[glmocr] + +## Usage + +# Enable plugins with -p flag +markitdown -p document.pdf +markitdown -p image.png + +# Or use programmatically +from markitdown import MarkItDown +md = MarkItDown(enable_plugins=True) +result = md.convert("document.pdf") +print(result.markdown) + +## Configuration + +Set your Zhipu API key: + +export ZHIPU_API_KEY=your_api_key_here +``` + +#### 3. 构建并发布 + +```bash +cd packages/markitdown-glmocr + +# 安装构建工具 +pip install build twine + +# 构建 wheel 和 sdist +python -m build + +# 检查包 +twine check dist/* + +# 上传到 TestPyPI 先验证 +twine upload --repository testpypi dist/* + +# 验证安装 +pip install --index-url https://test.pypi.org/simple/ markitdown-glmocr[glmocr] + +# 正式发布到 PyPI +twine upload dist/* +``` + +#### 4. PyPI 账号准备 + +- 注册 https://pypi.org 账号 +- 配置 API Token:Account settings → API tokens → Add API token +- 配置 `~/.pypirc`: + +```ini +[pypi] +username = __token__ +password = pypi-xxxxxxxxxxxx + +[testpypi] +username = __token__ +password = pypi-test-xxxxxxxxxxxx +``` + +### 优缺点 + +| 优点 | 缺点 | +|------|------| +| 标准Python生态分发方式 | 需要用户有Python环境 | +| 插件机制天然解耦,官方更新不受影响 | glmocr SDK 依赖较多(numpy, pymupdf等) | +| 版本管理清晰 | 需要维护PyPI账号和token | +| `pip install` 一行搞定 | | + +--- + +## 方案 B:PyInstaller 打包独立可执行文件 + +### 核心思路 + +将 markitdown + markitdown-glmocr + glmocr + 所有依赖打包成单个可执行文件,用户无需安装 Python。 + +### 详细步骤 + +#### 1. 创建打包配置 + +在项目根目录创建 `build_standalone/` 目录: + +``` +build_standalone/ +├── build.py # 构建脚本 +├── markitdown.spec # PyInstaller spec 文件 +└── README.md # 使用说明 +``` + +#### 2. 编写 PyInstaller spec 文件 + +`build_standalone/markitdown.spec`: + +```python +# -*- mode: python ; coding: utf-8 -*- +import sys +from pathlib import Path + +block_cipher = None + +# 收集所有隐式导入的模块 +hiddenimports = [ + 'markitdown', + 'markitdown.converters', + 'markitdown_glmocr', + 'glmocr', + 'pdfminer', + 'pdfminer.high_level', + 'pdfminer.layout', + 'pdfminer.utils', + 'pdfplumber', + 'PIL', + 'magika', + 'charset_normalizer', + 'markdownify', + 'beautifulsoup4', + 'bs4', + 'mammoth', + 'openpyxl', + 'pandas', + 'python_pptx', + 'lxml', + 'numpy', + 'pydantic', + 'pymupdf', + 'fitz', # pymupdf 的内部名 + 'tqdm', + 'yaml', + 'dotenv', + 'requests', + 'defusedxml', +] + +a = Analysis( + ['entry_point.py'], + pathex=[], + binaries=[], + datas=[ + # 包含 magika 的模型文件 + ('magika/models', 'magika/models'), + ], + hiddenimports=hiddenimports, + hookspath=[], + hooksconfig={}, + runtime_hooks=[], + excludes=[], + win_no_prefer_redirects=False, + win_private_assemblies=False, + cipher=block_cipher, + noarchive=False, +) + +pyz = PYZ(a.pure, a.zipped_data, cipher=block_cipher) + +exe = EXE( + pyz, + a.scripts, + a.binaries, + a.zipfiles, + a.datas, + [], + name='markitdown', + debug=False, + bootloader_ignore_signals=False, + strip=False, + upx=True, + upx_exclude=[], + runtime_tmpdir=None, + console=True, + disable_windowed_traceback=False, + argv_emulation=False, + target_arch=None, + codesign_identity=None, + entitlements_file=None, + icon=None, +) +``` + +#### 3. 编写入口文件 + +`build_standalone/entry_point.py`: + +```python +"""Entry point for PyInstaller build.""" +import sys +import os + +# 确保插件被启用 +if '-p' not in sys.argv and '--use-plugins' not in sys.argv: + # 自动启用 glmocr 插件 + sys.argv.insert(1, '-p') + +from markitdown.__main__ import main + +if __name__ == '__main__': + main() +``` + +#### 4. 编写构建脚本 + +`build_standalone/build.py`: + +```python +#!/usr/bin/env python3 +"""Build standalone markitdown executable with PyInstaller.""" +import subprocess +import sys +import platform +import shutil +from pathlib import Path + +def main(): + project_root = Path(__file__).parent.parent + build_dir = Path(__file__).parent + + # 1. 确保依赖已安装 + print(">>> Installing dependencies...") + subprocess.run([ + sys.executable, "-m", "pip", "install", "-e", + str(project_root / "packages" / "markitdown[all]"), + ], check=True) + subprocess.run([ + sys.executable, "-m", "pip", "install", "-e", + str(project_root / "packages" / "markitdown-glmocr[glmocr]"), + ], check=True) + subprocess.run([ + sys.executable, "-m", "pip", "install", "pyinstaller", + ], check=True) + + # 2. 执行 PyInstaller + print(">>> Building executable...") + subprocess.run([ + sys.executable, "-m", "PyInstaller", + "--clean", + "--noconfirm", + str(build_dir / "markitdown.spec"), + ], cwd=str(build_dir), check=True) + + # 3. 输出结果 + dist_dir = build_dir / "dist" + exe_name = "markitdown.exe" if platform.system() == "Windows" else "markitdown" + exe_path = dist_dir / exe_name + + if exe_path.exists(): + size_mb = exe_path.stat().st_size / (1024 * 1024) + print(f"\n✅ Build successful!") + print(f" Executable: {exe_path}") + print(f" Size: {size_mb:.1f} MB") + print(f" Platform: {platform.system()} {platform.machine()}") + else: + print("\n❌ Build failed - executable not found") + sys.exit(1) + +if __name__ == "__main__": + main() +``` + +#### 5. GitHub Actions 自动构建多平台 + +`.github/workflows/build-standalone.yml`: + +```yaml +name: Build Standalone Executable + +on: + push: + tags: ['v*'] + workflow_dispatch: + +jobs: + build: + strategy: + matrix: + include: + - os: windows-latest + artifact: markitdown-windows-x64.exe + - os: ubuntu-latest + artifact: markitdown-linux-x64 + - os: macos-latest + artifact: markitdown-macos-x64 + + runs-on: ${{ matrix.os }} + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-python@v5 + with: + python-version: '3.13' + + - name: Install dependencies + run: | + pip install -e ./packages/markitdown[all] + pip install -e ./packages/markitdown-glmocr[glmocr] + pip install pyinstaller + + - name: Build with PyInstaller + run: | + pyinstaller --clean --noconfirm build_standalone/markitdown.spec + working-directory: . + + - name: Upload artifact + uses: actions/upload-artifact@v4 + with: + name: ${{ matrix.artifact }} + path: dist/markitdown* + + release: + needs: build + runs-on: ubuntu-latest + if: startsWith(github.ref, 'refs/tags/v') + steps: + - uses: actions/download-artifact@v4 + with: + path: artifacts + + - name: Create Release + uses: softprops/action-gh-release@v2 + with: + files: artifacts/** + generate_release_notes: true +``` + +### 预估产物大小 + +| 平台 | 预估大小 | 说明 | +|------|---------|------| +| Windows x64 | ~80-120 MB | 含 Python 运行时 + numpy + pymupdf 等 | +| Linux x64 | ~60-90 MB | | +| macOS x64 | ~70-100 MB | | + +### 优缺点 + +| 优点 | 缺点 | +|------|------| +| 无需Python环境,双击可用 | 产物体积大(80-120MB) | +| 非技术用户友好 | 每次更新需重新打包 | +| 可离线使用 | PyInstaller 隐式导入容易遗漏,调试成本高 | +| 可通过 GitHub Releases 分发 | 跨平台需分别构建 | +| | 杀毒软件可能误报 | + +### 替代方案:Nuitka + +如果 PyInstaller 遇到问题,可考虑 [Nuitka](https://nuitka.net/): + +```bash +pip install nuitka +python -m nuitka --standalone --onefile \ + --enable-plugin=numpy,pandas \ + --include-data-dir=magika/models=magika/models \ + entry_point.py +``` + +Nuitka 编译为真正的机器码,性能更好,但构建时间更长。 + +--- + +## 方案 C:Docker 镜像 + +### 核心思路 + +基于官方 Dockerfile 扩展,加入 glmocr 插件。 + +### Dockerfile + +```dockerfile +FROM python:3.13-slim-bullseye + +ENV DEBIAN_FRONTEND=noninteractive + +RUN apt-get update && apt-get install -y --no-install-recommends \ + ffmpeg exiftool && \ + rm -rf /var/lib/apt/lists/* + +WORKDIR /app +COPY packages/markitdown /app/packages/markitdown +COPY packages/markitdown-glmocr /app/packages/markitdown-glmocr + +RUN pip --no-cache-dir install \ + /app/packages/markitdown[all] \ + /app/packages/markitdown-glmocr[glmocr] + +ENTRYPOINT ["markitdown"] +``` + +### 使用方式 + +```bash +# 构建 +docker build -t markitdown-glmocr . + +# 使用 +docker run --rm -v $(pwd):/data markitdown-glmocr -p /data/document.pdf + +# 发布到 GHCR +docker tag markitdown-glmocr ghcr.io/yourname/markitdown-glmocr:latest +docker push ghcr.io/yourname/markitdown-glmocr:latest +``` + +### 优缺点 + +| 优点 | 缺点 | +|------|------| +| 环境完全隔离 | 需要 Docker 环境 | +| 适合 CI/CD 集成 | 镜像体积 ~500MB+ | +| 服务端部署友好 | 桌面用户不友好 | + +--- + +## 方案 D:混合方案(推荐) + +### 执行优先级 + +``` +1️⃣ 方案A:PyPI 发布 markitdown-glmocr → Python 开发者首选 +2️⃣ 方案B:PyInstaller 打包 → 非技术用户 / 离线场景 +3️⃣ 方案C:Docker 镜像 → 服务端 / CI 场景(可选) +``` + +### 具体执行计划 + +#### Phase 1:PyPI 发布(1-2 天) + +1. **完善 markitdown-glmocr 包** + - [ ] 补充 README.md(安装、使用、配置说明) + - [ ] 补充 LICENSE 文件 + - [ ] 添加 `py.typed` 标记(如需类型提示支持) + - [ ] 修复 `__about__.py` 版本号为 `0.1.0` + - [ ] 确保所有依赖版本范围合理 + +2. **本地验证** + - [ ] 在全新虚拟环境中测试安装流程 + ```bash + python -m venv /tmp/test-env + source /tmp/test-env/bin/activate + pip install markitdown[all] markitdown-glmocr[glmocr] + markitdown -p --list-plugins # 应显示 markitdown_glmocr + markitdown -p test.pdf # 功能测试 + ``` + +3. **发布到 TestPyPI 验证** + - [ ] `python -m build` + - [ ] `twine upload --repository testpypi dist/*` + - [ ] 从 TestPyPI 安装并测试 + +4. **正式发布到 PyPI** + - [ ] `twine upload dist/*` + +5. **发布后验证** + - [ ] `pip install markitdown-glmocr[glmocr]` + - [ ] 功能测试通过 + +#### Phase 2:独立可执行文件(2-3 天) + +1. **搭建 PyInstaller 构建流程** + - [ ] 创建 `build_standalone/` 目录和配置 + - [ ] 本地 Windows 构建测试 + - [ ] 解决隐式导入问题(最耗时) + +2. **GitHub Actions CI/CD** + - [ ] 配置多平台构建 workflow + - [ ] 打 tag 触发自动构建和 Release + +3. **分发** + - [ ] GitHub Releases 页面提供下载 + - [ ] README 中添加下载链接 + +#### Phase 3:Docker 镜像(可选,0.5 天) + +1. **编写 Dockerfile** +2. **发布到 GHCR** +3. **文档补充** + +--- + +## 关于 PR 合并的判断 + +### 官方接受 PR 的可能性分析 + +| 因素 | 评估 | +|------|------| +| markitdown 已有插件机制 | ✅ 架构上完全兼容 | +| glmocr 是第三方商业API | ⚠️ 官方可能不愿绑定特定商业服务 | +| 官方已有 azure-doc-intel 集成 | ✅ 有先例,但 Azure 是微软自家产品 | +| PR 贡献者不是微软员工 | ⚠️ 可能需要较长时间审核 | +| markitdown 版本还在 0.x (Beta) | ✅ 正是引入新功能的阶段 | + +**结论**:官方大概率不会直接接受 glmocr 插件 PR(因为绑定了非微软的商业 API),但插件机制的存在意味着**不需要官方接受 PR**,独立发布到 PyPI 是完全合理的路径。 + +### 建议策略 + +1. **先独立发布到 PyPI**(方案A),不依赖官方 +2. **同时提交 PR**,作为"贡献回社区"的姿态,即使被拒也无所谓 +3. PR 描述中强调: + - 完全通过插件机制扩展,不修改核心代码 + - 可作为"第三方插件集成"的参考实现 + - 有完整的测试和文档 + +--- + +## 快速开始:5分钟发布到 PyPI + +如果你现在就想发布,执行以下命令: + +```bash +# 1. 进入 glmocr 插件目录 +cd D:/15-AI-Coding/markitdown/packages/markitdown-glmocr + +# 2. 安装构建工具 +pip install build twine + +# 3. 构建 +python -m build + +# 4. 检查 +twine check dist/* + +# 5. 发布到 TestPyPI(先测试) +twine upload --repository testpypi dist/* + +# 6. 确认无误后发布到正式 PyPI +twine upload dist/* +``` + +发布后,其他人只需: + +```bash +pip install markitdown-glmocr[glmocr] +export ZHIPU_API_KEY=your_key +markitdown -p your-file.pdf +``` + +--- + +## 附录:常见问题 + +### Q1: 用户不装 glmocr SDK,只装 markitdown-glmocr 会怎样? + +不会报错。`_converter.py` 中 glmocr 是 lazy import,只在实际转换时才检查。 +但建议用户安装 `markitdown-glmocr[glmocr]` 以获得完整功能。 + +### Q2: 如何处理 markitdown 核心包的版本兼容性? + +`markitdown-glmocr` 的 `pyproject.toml` 中声明 `markitdown>=0.1.0,<1.0.0`。 +markitdown 的插件接口(entry_points)是稳定的,0.x 版本间不会 breaking change。 + +### Q3: PyInstaller 打包后 API Key 如何配置? + +通过环境变量 `ZHIPU_API_KEY` 传入,或在运行时通过 `.env` 文件: +```bash +# 方式1:环境变量 +set ZHIPU_API_KEY=your_key +markitdown -p document.pdf + +# 方式2:.env 文件(glmocr SDK 自动读取) +echo ZHIPU_API_KEY=your_key > .env +markitdown -p document.pdf +``` + +### Q4: 能否做一个"一键安装包"给非技术用户? + +可以结合 PyInstaller + Inno Setup(Windows)或 create-dmg(macOS)做安装向导: + +``` +Windows: PyInstaller → .exe → Inno Setup → .exe 安装向导 +macOS: PyInstaller → binary → create-dmg → .dmg +Linux: PyInstaller → binary → AppImage → .AppImage +``` + +但这增加了维护成本,建议先只提供裸 executable,待有需求再加安装向导。 + +### Q5: uvx / pipx 支持吗? + +支持!发布到 PyPI 后: + +```bash +# 一次性运行(无需安装) +uvx --from markitdown-glmocr[glmocr] markitdown -p document.pdf + +# 或用 pipx +pipx run markitdown -p document.pdf +``` + +这是最推荐的非技术用户使用方式——比 PyInstaller 更轻量,且始终使用最新版。 diff --git a/docs/nova-markitdown/SKILL.md b/docs/nova-markitdown/SKILL.md new file mode 100644 index 000000000..c9c53a7dc --- /dev/null +++ b/docs/nova-markitdown/SKILL.md @@ -0,0 +1,173 @@ +--- +name: nova-markitdown +description: + Convert various file formats (PDF, Word, Excel, PPT, images, HTML, audio, video) to Markdown using markitdown CLI with dual OCR fallback:glmocr (primary) → paddleocr (fallback). Activate when users need file-to-markdown conversion, OCR recognition, content extraction, structured data from documents, or batch document processing. Keywords:PDF to markdown, image OCR, document conversion, markitdown, glmocr, paddleocr, file extraction. +compatibility: + Python 3.10+, pip packages:markitdown[all], markitdown-glmocr[glmocr], markitdown-paddleocr. Requires ZHIPU_API_KEY for glmocr, BAIDU_PADDLE_TOKEN for paddleocr fallback. Network access to Zhipu AI API and Baidu PaddleOCR API. +metadata: + author: hankl + version: "2.0.0" +--- + +# nova-markitdown + +使用 markitdown 命令行工具将各种文件格式转换为 Markdown,**双 OCR 引擎自动降级**:glmocr(主)→ paddleocr(备)。 + +## 触发条件 + +当用户需要以下操作时激活此技能: + +- 将文件(PDF、Word、Excel、PPT、图片、HTML、音频、视频等)转换为 Markdown 文本 +- 提取文件中的文本内容、表格、图片描述等 +- 对 PDF 或图片进行 OCR 识别和结构化提取 +- 批量转换多个文件为 Markdown + +## 环境设置 + +### 安装依赖 + +```bash +# 基础 markitdown(支持大部分文件格式) +pip install 'markitdown' + +# markitdown-glmocr 插件(主 OCR,智谱 GLM-OCR) +pip install 'markitdown-glmocr[glmocr]' + +# markitdown-paddleocr 插件(备 OCR,百度 PaddleOCR) +pip install 'markitdown-paddleocr' +``` + +### 环境变量 + +```bash +# 主 OCR:智谱 API Key(glmocr) +export ZHIPU_API_KEY="your-zhipu-api-key" + +# 备 OCR:百度 PaddleOCR Token(paddleocr,glmocr 失败时自动切换) +export BAIDU_PADDLE_TOKEN="your-paddle-token" + +# 可选配置 +export GLMOCR_MODEL="glm-ocr" # glmocr 模型名称 +export GLMOCR_TIMEOUT="600" # glmocr 请求超时秒数 +export PADDLE_OCR_MODEL="PaddleOCR-VL-1.5" # paddleocr 模型名称 +``` + +> **重要**:`ZHIPU_API_KEY` 用于 glmocr(主),`BAIDU_PADDLE_TOKEN` 用于 paddleocr(备)。两者都设置可实现自动降级。 + +### 验证安装 + +```bash +markitdown --version +markitdown --list-plugins # 输出中应包含 markitdown_glmocr 和 markitdown_paddleocr +``` + +## 核心规则 + +1. **优先使用 markitdown 命令行**:所有文件转换优先通过 `markitdown` CLI 完成。 +2. **PDF 和图片使用双 OCR 降级策略**: + - **第一步**:使用 `markitdown -p`(glmocr 插件)尝试解析 + - **第二步**:若 glmocr 报错(API 错误、超时、Key 失效等),自动切换到 paddleocr 插件重试 + - **实现方式**:通过 Python 脚本封装,捕获异常后切换 +3. **其他文件类型不使用 `-p`**:Word、Excel、PPT、HTML、音频等使用不带 `-p` 的 markitdown 命令。 +4. **复杂场景回退到 Python SDK**:需要结构化 JSON 输出、按区域筛选、自定义处理流程时,使用 Python 代码。详见 [advanced-usage.md](references/advanced-usage.md)。 + +## 快速参考 + +| 文件类型 | 命令 | `-p` | 说明 | +|----------|------|:---:|------| +| PDF | `markitdown -p file.pdf -o out.md` | Yes | glmocr AI OCR | +| 图片 (.jpg/.png) | `markitdown -p image.png -o out.md` | Yes | glmocr AI OCR | +| Word (.docx) | `markitdown file.docx -o out.md` | No | 内置转换器 | +| Excel (.xlsx/.xls) | `markitdown file.xlsx -o out.md` | No | 内置转换器 | +| PPT (.pptx) | `markitdown file.pptx -o out.md` | No | 内置转换器 | +| HTML | `markitdown file.html -o out.md` | No | 内置转换器 | +| CSV/JSON/XML | `markitdown file.csv -o out.md` | No | 内置转换器 | +| 音频 | `markitdown audio.mp3 -o out.md` | No | 内置转换器 | +| ZIP | `markitdown archive.zip -o out.md` | No | 自动遍历 | +| YouTube | `markitdown "https://youtube.com/..." -o out.md` | No | 视频转录 | + +## 使用指南 + +### PDF 转换(双 OCR 降级) + +```bash +# 方式1:CLI 直接调用(仅 glmocr,无降级) +markitdown -p document.pdf -o output.md + +# 方式2:Python 双 OCR 降级(推荐,glmocr 失败自动切 paddleocr) +python -c " +from markitdown import MarkItDown +from markitdown_glmocr import GlmOcrConverter +from markitdown_paddleocr import PaddleOcrConverter + +md = MarkItDown(enable_plugins=False) +try: + md.register_converter(GlmOcrConverter(), priority=-1.0) + result = md.convert('document.pdf') + if not result.markdown.strip(): + raise Exception('Empty result') +except Exception as e: + print(f'glmocr failed: {e}, falling back to paddleocr...') + md = MarkItDown(enable_plugins=False) + md.register_converter(PaddleOcrConverter(), priority=-1.0) + result = md.convert('document.pdf') +print(result.markdown) +" +``` + +工作原理:纯文本页面使用 pdfplumber/pdfminer 快速提取;复杂页面(含图片、表格、公式)自动使用 AI OCR。glmocr 失败时自动降级到 paddleocr。 + +### 图片转换(双 OCR 降级) + +```bash +# CLI 直接调用(仅 glmocr) +markitdown -p photo.jpg -o photo.md + +# Python 双 OCR 降级(推荐) +python -c " +from markitdown import MarkItDown +from markitdown_glmocr import GlmOcrConverter +from markitdown_paddleocr import PaddleOcrConverter + +md = MarkItDown(enable_plugins=False) +try: + md.register_converter(GlmOcrConverter(), priority=-1.0) + result = md.convert('photo.jpg') + if not result.markdown.strip(): + raise Exception('Empty result') +except Exception as e: + print(f'glmocr failed: {e}, falling back to paddleocr...') + md = MarkItDown(enable_plugins=False) + md.register_converter(PaddleOcrConverter(), priority=-1.0) + result = md.convert('photo.jpg') +print(result.markdown) +" +``` + +### 其他文件格式 + +```bash +markitdown document.docx -o document.md # Word +markitdown spreadsheet.xlsx -o data.md # Excel +markitdown presentation.pptx -o slides.md # PPT +markitdown webpage.html -o webpage.md # HTML +markitdown data.csv -o data.md # CSV +markitdown config.json -o config.md # JSON +markitdown archive.zip -o archive.md # ZIP +``` + +## 故障排查 + +**插件未发现**:运行 `markitdown --list-plugins`,若无 glmocr 则 `pip install 'markitdown-glmocr[glmocr]'`,若无 paddleocr 则 `pip install markitdown-paddleocr`。 + +**glmocr API Key 错误**:检查 `echo $ZHIPU_API_KEY`,或在 `.env` 中设置。glmocr 失败时会自动降级到 paddleocr。 + +**paddleocr Token 错误**:检查 `echo $BAIDU_PADDLE_TOKEN`,或在 `.env` 中设置。 + +**PDF 输出为空或质量差**:确保使用 `-p` 参数,检查 API Key/Token,可设置 `GLMOCR_ENABLE_LAYOUT=true` 提升结构化输出。 + +**两个 OCR 都失败**:检查网络连接,确认两个 API Key/Token 都有效。 + +## 高级用法 + +需要结构化 JSON 输出、按区域筛选、批量处理、自定义参数、**双 OCR 降级封装**等高级场景,请参考 [advanced-usage.md](references/advanced-usage.md),包含 Python SDK 的完整示例和 `DualOcrConverter` 统一封装。 diff --git a/docs/nova-markitdown/references/advanced-usage.md b/docs/nova-markitdown/references/advanced-usage.md new file mode 100644 index 000000000..f21a3699d --- /dev/null +++ b/docs/nova-markitdown/references/advanced-usage.md @@ -0,0 +1,253 @@ +# 高级用法:Python SDK + 双 OCR 降级 + +当 markitdown 命令行无法满足需求时(如需要结构化 JSON 输出、按区域筛选、自定义处理流程、双 OCR 降级等),使用 Python 代码实现。 + +## 场景 0:DualOcrConverter — 双 OCR 自动降级(推荐) + +`DualOcrConverter` 封装了 glmocr(主)→ paddleocr(备)的自动降级逻辑,是 PDF/图片处理的推荐方式。 + +```python +from markitdown import MarkItDown +from markitdown_glmocr import GlmOcrConverter +from markitdown_paddleocr import PaddleOcrConverter + +class DualOcrConverter: + """双 OCR 转换器:glmocr(主)→ paddleocr(备)自动降级。""" + + def __init__(self, glmocr_kwargs=None, paddleocr_kwargs=None): + self.glmocr_kwargs = glmocr_kwargs or {} + self.paddleocr_kwargs = paddleocr_kwargs or {} + + def convert(self, file_path: str) -> str: + """转换文件,glmocr 失败自动降级到 paddleocr。""" + # 第一步:尝试 glmocr + try: + md = MarkItDown(enable_plugins=False) + md.register_converter(GlmOcrConverter(**self.glmocr_kwargs), priority=-1.0) + result = md.convert(file_path) + if result.markdown and result.markdown.strip(): + print("✓ glmocr 解析成功") + return result.markdown + raise Exception("glmocr returned empty result") + except Exception as e: + print(f"⚠ glmocr 失败: {e}") + + # 第二步:降级到 paddleocr + try: + md = MarkItDown(enable_plugins=False) + md.register_converter(PaddleOcrConverter(**self.paddleocr_kwargs), priority=-1.0) + result = md.convert(file_path) + if result.markdown and result.markdown.strip(): + print("✓ paddleocr 解析成功(降级)") + return result.markdown + raise Exception("paddleocr returned empty result") + except Exception as e: + print(f"✗ paddleocr 也失败: {e}") + raise RuntimeError(f"Both OCR engines failed. glmocr error preceded paddleocr fallback error.") + +# 使用 +converter = DualOcrConverter() +markdown = converter.convert("document.pdf") +``` + +### 自定义参数 + +```python +converter = DualOcrConverter( + glmocr_kwargs={ + "api_key": "sk-xxx", + "enable_layout": True, + "force_ai": True, + }, + paddleocr_kwargs={ + "token": "your-paddle-token", + "model": "PaddleOCR-VL-1.5", + "use_chart_recognition": True, + } +) +markdown = converter.convert("complex_report.pdf") +``` + +### 批量处理 + 双 OCR + +```python +from pathlib import Path + +converter = DualOcrConverter() +pdf_dir = Path("./documents") +output_dir = pdf_dir / "output" +output_dir.mkdir(exist_ok=True) + +for pdf_file in pdf_dir.glob("*.pdf"): + try: + markdown = converter.convert(str(pdf_file)) + (output_dir / f"{pdf_file.stem}.md").write_text(markdown, encoding="utf-8") + print(f"✓ {pdf_file.name}") + except RuntimeError: + print(f"✗ {pdf_file.name} — both OCR engines failed") +``` + +## 场景 1:结构化 JSON 输出(glmocr 区域标签、边界框) + +```python +import glmocr + +# 一行调用完成 OCR +result = glmocr.parse("report.pdf") + +# 获取 Markdown 文本 +print(result.markdown_result) + +# 获取结构化数据(按页分组,每页包含多个区域) +for page_idx, page_regions in enumerate(result.json_result): + print(f"Page {page_idx + 1}: {len(page_regions)} regions") + for region in page_regions: + print(f" [{region['label']}] {region['content'][:60]}") + +# 按标签筛选特定类型内容 +tables = [r for r in result.json_result[0] if r["label"] == "table"] +formulas = [r for r in result.json_result[0] if r["label"] == "formula"] +titles = [r for r in result.json_result[0] if r["label"] == "title"] + +# 保存到磁盘(Markdown + JSON 同时保存) +result.save(output_dir="./output") +``` + +### 支持的区域标签 + +| 标签 | 说明 | +|------|------| +| `title` | 标题 | +| `text` | 正文文本 | +| `table` | 表格 | +| `figure` | 图片 | +| `formula` | 公式 | +| `header` | 页眉 | +| `footer` | 页脚 | +| `page_number` | 页码 | +| `reference` | 参考文献 | +| `seal` | 印章 | + +## 场景 2:单独使用 PaddleClient(paddleocr 直接调用) + +```python +from markitdown_paddleocr import PaddleClient + +client = PaddleClient(token="your-paddle-token") + +# 本地文件 OCR +with open("image.png", "rb") as f: + markdown = client.ocr(file_bytes=f.read(), filename="image.png") +print(markdown) + +# URL 模式 OCR +markdown = client.ocr(file_url="https://example.com/document.pdf") +print(markdown) +``` + +## 场景 3:MarkItDown Python API + 单个 Converter + +```python +from markitdown import MarkItDown +from markitdown_glmocr import GlmOcrConverter +# 或 from markitdown_paddleocr import PaddleOcrConverter + +# glmocr +converter = GlmOcrConverter() +md = MarkItDown(enable_plugins=False) +md.register_converter(converter, priority=-1.0) +result = md.convert("document.pdf") +print(result.text_content) + +# paddleocr +from markitdown_paddleocr import PaddleOcrConverter +converter = PaddleOcrConverter() +md = MarkItDown(enable_plugins=False) +md.register_converter(converter, priority=-1.0) +result = md.convert("document.pdf") +print(result.text_content) +``` + +## 场景 4:自定义转换器参数 + +```python +from markitdown import MarkItDown +from markitdown_glmocr import GlmOcrConverter +from markitdown_paddleocr import PaddleOcrConverter + +# glmocr 自定义 +glmocr_converter = GlmOcrConverter( + api_key="sk-xxx", + timeout=600, + enable_layout=True, + force_ai=True, +) + +# paddleocr 自定义 +paddleocr_converter = PaddleOcrConverter( + token="your-token", + model="PaddleOCR-VL-1.5", + poll_interval=3.0, + poll_timeout=600.0, + force_ai=True, + use_chart_recognition=True, +) + +# 使用 DualOcrConverter 封装 +converter = DualOcrConverter( + glmocr_kwargs={"api_key": "sk-xxx", "enable_layout": True}, + paddleocr_kwargs={"token": "your-token", "use_chart_recognition": True}, +) +markdown = converter.convert("complex_document.pdf") +``` + +## 场景 5:只处理图片(不经过 PDF) + +```python +import glmocr + +# glmocr 直接对图片 OCR +result = glmocr.parse("screenshot.png") +print(result.markdown_result) + +# paddleocr 直接对图片 OCR +from markitdown_paddleocr import PaddleClient +client = PaddleClient(token="your-token") +with open("photo.jpg", "rb") as f: + markdown = client.ocr(file_bytes=f.read(), filename="photo.jpg") +print(markdown) +``` + +## 场景 6:批量处理多个文件 + +```python +from pathlib import Path + +# 使用 DualOcrConverter 批量处理(推荐) +converter = DualOcrConverter() + +pdf_dir = Path("./documents") +for pdf_file in pdf_dir.glob("*.pdf"): + try: + markdown = converter.convert(str(pdf_file)) + output_path = pdf_dir / "output" / f"{pdf_file.stem}.md" + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text(markdown, encoding="utf-8") + print(f"✓ {pdf_file.name}") + except RuntimeError: + print(f"✗ {pdf_file.name} — both OCR engines failed") +``` + +## OCR 引擎对比 + +| 维度 | glmocr | paddleocr | +|------|--------|-----------| +| API 风格 | 同步 SDK 调用 | 异步 Job 轮询(submit → poll → fetch) | +| 认证 | `ZHIPU_API_KEY` | `BAIDU_PADDLE_TOKEN` | +| 结果格式 | SDK 封装对象 | JSONL 流 | +| 结构化输出 | ✅ 区域标签 + 边界框 | ❌ 仅 Markdown | +| 表格识别 | ✅ HTML → Markdown | ✅ HTML 表格 | +| 公式识别 | ✅ LaTeX | ✅ LaTeX | +| 印章识别 | ✅ | ✅ | +| 响应速度 | 快(同步) | 较慢(需轮询,2-30s) | +| 适用场景 | 首选,结构化需求 | 降级备选,glmocr 不可用时 | diff --git a/docs/paddleocr-plugin-design.md b/docs/paddleocr-plugin-design.md new file mode 100644 index 000000000..8adeb8cfa --- /dev/null +++ b/docs/paddleocr-plugin-design.md @@ -0,0 +1,102 @@ +# markitdown-paddleocr 方案设计 + +## 概述 + +基于百度 PaddleOCR 云端 API 实现的 markitdown OCR 插件,参考 markitdown-glmocr 架构。 + +## 与 glmocr 的核心差异 + +| 维度 | glmocr | paddleocr | +|------|--------|-----------| +| API 风格 | 同步 SDK 调用 | 异步 Job 轮询(submit → poll → fetch result) | +| 认证 | `ZHIPU_API_KEY` | `BAIDU_PADDLE_TOKEN` (bearer token) | +| 结果格式 | SDK 封装对象 | JSONL 流(逐行 JSON,含 layoutParsingResults) | +| 图片处理 | SDK 内置 base64 编码 | 需手动上传文件或传 fileUrl | +| 模型 | glm-ocr | PaddleOCR-VL-1.5 | + +## 架构 + +``` +markitdown-paddleocr/ +├── pyproject.toml +├── README.md +└── src/markitdown_paddleocr/ + ├── __init__.py # 导出 + __plugin_interface_version__ + ├── __about__.py # __version__ + ├── _config.py # PaddleOcrConfig dataclass + ├── _paddle_client.py # PaddleOCR API 客户端(submit/poll/fetch) + ├── _converter.py # PaddleOcrConverter(DocumentConverter) + └── _plugin.py # register_converters 入口 +``` + +## 核心流程 + +``` +文件输入 (PDF/图片) + │ + ▼ +PaddleOcrConverter.convert() + │ + ├─ 图片文件 ──► _convert_image() ──► PaddleClient.ocr() ──► markdown + │ + └─ PDF 文件 ──► _convert_pdf() + │ + ├─ 逐页分析 (pdfplumber) + ├─ 纯文本页 ──► pdfplumber 提取 + └─ 复杂页 ──► 渲染为图片 ──► PaddleClient.ocr() ──► markdown +``` + +## PaddleClient 核心逻辑 + +```python +class PaddleClient: + JOB_URL = "https://paddleocr.aistudio-app.com/api/v2/ocr/jobs" + + def ocr(self, file_bytes, filename=None, file_url=None) -> str: + # 1. 提交 Job(本地文件用 multipart,URL 用 JSON) + job_id = self._submit(file_bytes, filename, file_url) + # 2. 轮询 Job 状态(pending → running → done) + result_url = self._poll(job_id) + # 3. 获取 JSONL 结果,拼接 markdown + return self._fetch_markdown(result_url) +``` + +## 关键设计决策 + +1. **异步轮询间隔**: 默认 2s,可配置,最大等待 300s +2. **PDF 处理策略**: 与 glmocr 一致,纯文本页用 pdfplumber,复杂页用 OCR +3. **图片上传**: 使用 multipart/form-data 上传本地文件;支持 fileUrl 模式 +4. **结果解析**: 从 JSONL 的 `layoutParsingResults[].markdown.text` 提取 markdown +5. **环境变量**: `BAIDU_PADDLE_TOKEN`(必需),`PADDLE_OCR_MODEL`(默认 PaddleOCR-VL-1.5) +6. **可选参数**: `useDocOrientationClassify`, `useDocUnwarping`, `useChartRecognition` + +## 依赖 + +``` +markitdown>=0.1.0 +pdfminer.six>=20251230 +pdfplumber>=0.11.9 +Pillow>=9.0.0 +requests>=2.28.0 +``` + +## 入口点 + +```toml +[project.entry-points."markitdown.plugin"] +markitdown_paddleocr = "markitdown_paddleocr" +``` + +## 使用方式 + +```bash +# 环境变量 +export BAIDU_PADDLE_TOKEN="your-token" + +# CLI +markitdown -p document.pdf + +# Python +from markitdown_paddleocr import PaddleOcrConverter +converter = PaddleOcrConverter(token="your-token") +``` diff --git "a/docs/panddle\347\244\272\344\276\213\344\273\243\347\240\201.md" "b/docs/panddle\347\244\272\344\276\213\344\273\243\347\240\201.md" new file mode 100644 index 000000000..b1d68059a --- /dev/null +++ "b/docs/panddle\347\244\272\344\276\213\344\273\243\347\240\201.md" @@ -0,0 +1,122 @@ +# Please make sure the requests library is installed +# pip install requests +import json +import os +import requests +import sys +import time + +JOB_URL = "https://paddleocr.aistudio-app.com/api/v2/ocr/jobs" +TOKEN = "7963b85a6bac7a4f5243d26210f1b8fa86daf5ef" +MODEL = "PaddleOCR-VL-1.5" + +file_path = "" + +headers = { + "Authorization": f"bearer {TOKEN}", +} + +optional_payload = { + "useDocOrientationClassify": False, + "useDocUnwarping": False, + "useChartRecognition": False, +} + +print(f"Processing file: {file_path}") + +if file_path.startswith("http"): + # URL Mode + headers["Content-Type"] = "application/json" + payload = { + "fileUrl": file_path, + "model": MODEL, + "optionalPayload": optional_payload + } + job_response = requests.post(JOB_URL, json=payload, headers=headers) +else: + # Local File Mode + if not os.path.exists(file_path): + print(f"Error: File not found at {file_path}") + sys.exit(1) + + data = { + "model": MODEL, + "optionalPayload": json.dumps(optional_payload) + } + + with open(file_path, "rb") as f: + files = {"file": f} + job_response = requests.post(JOB_URL, headers=headers, data=data, files=files) + +print(f"Response status: {job_response.status_code}") +if job_response.status_code != 200: + print(f"Response content: {job_response.text}") + +assert job_response.status_code == 200 +jobId = job_response.json()["data"]["jobId"] +print(f"Job submitted successfully. job id: {jobId}") +print("Start polling for results") + +jsonl_url = "" +while True: + job_result_response = requests.get(f"{JOB_URL}/{jobId}", headers=headers) + assert job_result_response.status_code == 200 + state = job_result_response.json()["data"]["state"] + if state == 'pending': + print("The current status of the job is pending") + elif state == 'running': + try: + total_pages = job_result_response.json()['data']['extractProgress']['totalPages'] + extracted_pages = job_result_response.json()['data']['extractProgress']['extractedPages'] + print(f"The current status of the job is running, total pages: {total_pages}, extracted pages: {extracted_pages}") + except KeyError: + print("The current status of the job is running...") + elif state == 'done': + extracted_pages = job_result_response.json()['data']['extractProgress']['extractedPages'] + start_time = job_result_response.json()['data']['extractProgress']['startTime'] + end_time = job_result_response.json()['data']['extractProgress']['endTime'] + print(f"Job completed, successfully extracted pages: {extracted_pages}, start time: {start_time}, end time: {end_time}") + jsonl_url = job_result_response.json()['data']['resultUrl']['jsonUrl'] + break + elif state == "failed": + error_msg = job_result_response.json()['data']['errorMsg'] + print(f"Job failed, failure reason:{error_msg}") + sys.exit() + + time.sleep(5) + +if jsonl_url: + jsonl_response = requests.get(jsonl_url) + jsonl_response.raise_for_status() + lines = jsonl_response.text.strip().split('\n') + output_dir = "output" + os.makedirs(output_dir, exist_ok=True) + page_num = 0 + for line_num, line in enumerate(lines, start=1): + line = line.strip() + if not line: + continue + result = json.loads(line)["result"] + for i, res in enumerate(result["layoutParsingResults"]): + md_filename = os.path.join(output_dir, f"doc_{page_num}.md") + with open(md_filename, "w", encoding="utf-8") as md_file: + md_file.write(res["markdown"]["text"]) + print(f"Markdown document saved at {md_filename}") + for img_path, img in res["markdown"]["images"].items(): + full_img_path = os.path.join(output_dir, img_path) + os.makedirs(os.path.dirname(full_img_path), exist_ok=True) + img_bytes = requests.get(img).content + with open(full_img_path, "wb") as img_file: + img_file.write(img_bytes) + print(f"Image saved to: {full_img_path}") + for img_name, img in res["outputImages"].items(): + img_response = requests.get(img) + if img_response.status_code == 200: + # Save image to local + filename = os.path.join(output_dir, f"{img_name}_{page_num}.jpg") + with open(filename, "wb") as f: + f.write(img_response.content) + print(f"Image saved to: {filename}") + else: + print(f"Failed to download image, status code: {img_response.status_code}") + page_num += 1 diff --git a/docs/spec.md b/docs/spec.md new file mode 100644 index 000000000..d2d624ddf --- /dev/null +++ b/docs/spec.md @@ -0,0 +1,35 @@ +# sprint0 +# 目标 +重构调用ai接口解析PDF的功能:对包含图片/表格的页面截图后调用 AI 接口转 Markdown + +# 技术要求 +使用glm-ocr能力,zai-sdk,如下 + +# 关键信息:api key:528b833ddafd74f7ce6d32f6d1e3b39e.yLrspX8jiUwh5BGd 需要从配置文件读取 + +# 安装最新版本 +pip install zai-sdk +# 或指定版本 +pip install zai-sdk==0.2.2 +from zai import ZhipuAiClient + +# 初始化客户端 +client = ZhipuAiClient(api_key="your-api-key") + +image_url = "https://cdn.bigmodel.cn/static/logo/introduction.png" + +# 调用布局解析 API +response = client.layout_parsing.create( + model="glm-ocr", + file=image_url +) + +# 输出结果 +print(response) + +详细文档:https://docs.bigmodel.cn/cn/guide/models/vlm/glm-ocr#python + +先设计重构方案 + +## sprint1 +重命名:nova-pdf 改成markitdown-glmocr diff --git a/packages/markitdown-glmocr/README.md b/packages/markitdown-glmocr/README.md new file mode 100644 index 000000000..35c221524 --- /dev/null +++ b/packages/markitdown-glmocr/README.md @@ -0,0 +1,283 @@ +# markitdown-glmocr + +智能 PDF 转 Markdown 插件,使用 glmocr SDK(智谱 GLM-OCR)驱动的图片和表格提取。 + +## 特性 + +- 🔍 **智能检测**:自动识别每页内容类型(纯文本 vs 图片/表格) +- 📄 **默认解析**:纯文本页面使用 pdfplumber/pdfminer 提取,速度快、成本低 +- 🤖 **AI 增强**:复杂页面(图片、表格)使用 glmocr SDK 转换为 Markdown +- ⚡ **一行调用**:`glmocr.parse("document.pdf")` 完成 OCR,无需手动截图编码 +- 📊 **结构化输出**:返回 Markdown + JSON 结构(含区域标签、边界框) + +## 安装 + +```bash +# 基础安装 +pip install markitdown-glmocr + +# 安装 AI 功能 +pip install markitdown-glmocr[glmocr] +``` + +## 配置 + +### 环境变量(推荐) + +```bash +# 必需:智谱 API Key +export ZHIPU_API_KEY="your-zhipu-api-key" + +# 可选 +export GLMOCR_MODEL="glm-ocr" # 模型名称 +export GLMOCR_TIMEOUT="600" # 请求超时(秒) +export GLMOCR_ENABLE_LAYOUT="true" # 启用布局检测 +export GLMOCR_LOG_LEVEL="INFO" # 日志级别 +``` + +### 配置优先级 + +``` +构造函数参数 > 环境变量 > .env 文件 > config.yaml > 内置默认值 +``` + +### 本地敏感配置 + +```bash +# 创建 .env 文件(自动读取) +echo "ZHIPU_API_KEY=your-api-key" > .env +``` + +## 使用方法 + +### 命令行(推荐) + +```bash +# 1. 设置 API Key +export ZHIPU_API_KEY="sk-xxx" + +# 2. 查看已安装插件 +markitdown --list-plugins + +# 3. 使用插件转换 PDF +markitdown -p document.pdf + +# 4. 保存到文件 +markitdown -p document.pdf -o output.md +``` + +### Python API + +```python +from markitdown import MarkItDown +from markitdown_glmocr import GlmOcrConverter + +# 方式1:自动从环境变量读取 ZHIPU_API_KEY +converter = GlmOcrConverter() +md = MarkItDown(enable_plugins=False) +md.register_converter(converter, priority=-1.0) +result = md.convert("document.pdf") +print(result.markdown) + +# 方式2:手动传入 API Key +converter = GlmOcrConverter(api_key="sk-xxx") +md = MarkItDown(enable_plugins=False) +md.register_converter(converter, priority=-1.0) +result = md.convert("document.pdf") +print(result.markdown) + +# 方式3:直接使用 glmocr SDK(更简单) +import glmocr +result = glmocr.parse("document.pdf") +print(result.markdown_result) # Markdown 输出 +print(result.json_result) # 结构化 JSON(区域标签、边界框) +``` + +### 处理结果 + +```python +import glmocr + +result = glmocr.parse("report.pdf") + +# 获取 Markdown +print(result.markdown_result) + +# 获取结构化数据(按页分组) +for page_idx, page_regions in enumerate(result.json_result): + print(f"Page {page_idx + 1}: {len(page_regions)} regions") + for region in page_regions: + print(f" [{region['label']}] {region['content'][:60]}") + +# 按标签筛选 +tables = [r for r in result.json_result[0] if r["label"] == "table"] +formulas = [r for r in result.json_result[0] if r["label"] == "formula"] + +# 保存到磁盘 +result.save(output_dir="./output") +``` + +## 配置选项 + +### GlmOcrConverter 参数 + +| 参数 | 类型 | 默认值 | 说明 | +|------|------|--------|------| +| `api_key` | str | 环境变量 `ZHIPU_API_KEY` | 智谱 API Key | +| `timeout` | int | 1800 | 请求超时(秒) | +| `enable_layout` | bool | False | 启用布局检测 | +| `force_ai` | bool | False | 强制所有页面使用 AI | + +### 环境变量 + +| 变量 | 说明 | 示例 | +|------|------|------| +| `ZHIPU_API_KEY` | API Key(必需) | `sk-abc123` | +| `GLMOCR_MODEL` | 模型名称 | `glm-ocr` | +| `GLMOCR_TIMEOUT` | 请求超时(秒) | `600` | +| `GLMOCR_ENABLE_LAYOUT` | 布局检测 | `true` | +| `GLMOCR_LOG_LEVEL` | 日志级别 | `INFO` | + +## 工作原理 + +``` +PDF 输入 + │ + ▼ +逐页分析内容类型 + │ + ├─ 纯文本页面 ──► pdfplumber 提取文本 + │ + └─ 复杂页面(图片/表格) + │ + └─► glmocr.parse() 一行调用 + │ + ├─ 内置截图渲染 + ├─ 内置 base64 编码 + └─ 内置 OCR 识别 + │ + ▼ +合并输出完整 Markdown +``` + +## 区域标签(json_result) + +glmocr SDK 返回的结构化数据支持以下标签: + +| 标签 | 说明 | +|------|------| +| `title` | 标题 | +| `text` | 正文文本 | +| `table` | 表格 | +| `figure` | 图片 | +| `formula` | 公式 | +| `header` | 页眉 | +| `footer` | 页脚 | +| `page_number` | 页码 | +| `reference` | 参考文献 | +| `seal` | 印章 | + +## 技术架构 + +- **glmocr**: 智谱 OCR SDK,一行代码完成 PDF/图片解析 +- **pdfplumber**: PDF 页面分析和纯文本提取 +- **pdfminer**: 纯文本页面提取备用 + +## 依赖 + +- `markitdown>=0.1.0` - 基础框架 +- `pdfplumber>=0.11.9` - PDF 解析和截图 +- `pdfminer.six>=20251230` - 文本提取备用 +- `Pillow>=9.0.0` - 图像处理 +- `glmocr` - 智谱 OCR SDK(可选,AI 功能需要) + +## 发布到 PyPI + +### 前置条件 + +1. 安装构建工具: + +```bash +pip install build twine hatch +``` + +2. 配置 PyPI API Token(Windows 用户环境变量): + +```powershell +# PowerShell 设置用户环境变量 +[System.Environment]::SetEnvironmentVariable('PYPI_API_TOKEN', 'pypi-...', 'User') +``` + +或在 Bash/Zsh 中: + +```bash +export PYPI_API_TOKEN="pypi-..." +``` + +### 快速发布(推荐) + +项目根目录提供了上传脚本,可一键发布两个插件: + +**Bash / Git Bash:** +```bash +# 构建两个插件 +cd packages/markitdown-glmocr && hatch build + +cd ../markitdown-paddleocr && hatch build + +# 上传(自动上传所有构建的版本) +cd ../.. +./scripts/pypi-upload.sh + +# 或指定版本号 +./scripts/pypi-upload.sh 0.2.0 +``` + +**PowerShell:** +```powershell +# 构建两个插件 +cd packages/markitdown-glmocr; hatch build +cd ../markitdown-paddleocr; hatch build + +# 上传 +cd ../.. +.\scripts\pypi-upload.ps1 + +# 或指定版本号 +.\scripts\pypi-upload.ps1 -Version "0.2.0" +``` + +### 手动发布 + +```bash +# 1. 进入项目目录 +cd packages/markitdown-glmocr + +# 2. 构建 +hatch build + +# 3. 检查 +twine check dist/* + +# 4. 上传 +twine upload --username __token__ --password "$PYPI_API_TOKEN" --disable-progress-bar dist/* +``` + +### 发布到 TestPyPI(测试) + +```bash +twine upload --repository testpypi --username __token__ --password "$PYPI_API_TOKEN" --disable-progress-bar dist/* + +# 从 TestPyPI 安装验证 +pip install --index-url https://test.pypi.org/simple/ markitdown-glmocr +``` + +### 注意事项 + +- 发布前确保 `src/markitdown_glmocr/__about__.py` 中的版本号已更新 +- 同一版本号不能重复上传,如需修正必须 bump 版本号 +- `PYPI_API_TOKEN` 切勿提交到代码仓库 + +## 许可证 + +MIT \ No newline at end of file diff --git a/packages/markitdown-glmocr/pyproject.toml b/packages/markitdown-glmocr/pyproject.toml new file mode 100644 index 000000000..ea06823ce --- /dev/null +++ b/packages/markitdown-glmocr/pyproject.toml @@ -0,0 +1,60 @@ +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" + +[project] +name = "markitdown-glmocr" +dynamic = ["version"] +description = "Intelligent PDF to Markdown converter using glmocr SDK" +readme = "README.md" +requires-python = ">=3.10" +license = "MIT" +keywords = ["markitdown", "pdf", "ocr", "ai", "llm", "vision", "glm-ocr", "glmocr"] +authors = [ + { name = "Contributors", email = "noreply@github.com" }, +] +classifiers = [ + "Development Status :: 4 - Beta", + "Programming Language :: Python", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", +] + +dependencies = [ + "markitdown>=0.1.0", + "pdfminer.six>=20251230", + "pdfplumber>=0.11.9", + "Pillow>=9.0.0", +] + +[project.optional-dependencies] +glmocr = [ + "glmocr", +] +dev = [ + "pytest>=7.0.0", +] + +[project.urls] +Documentation = "https://github.com/microsoft/markitdown#readme" +Issues = "https://github.com/microsoft/markitdown/issues" +Source = "https://github.com/microsoft/markitdown" + +[tool.hatch.version] +path = "src/markitdown_glmocr/__about__.py" + +# Plugin entry point - MarkItDown will discover this plugin +[project.entry-points."markitdown.plugin"] +markitdown_glmocr = "markitdown_glmocr" + +[tool.hatch.build.targets.sdist] +only-include = ["src/markitdown_glmocr"] + +[tool.hatch.build.targets.wheel] +packages = ["src/markitdown_glmocr"] + +[tool.pytest.ini_options] +testpaths = ["tests"] +python_files = ["test_*.py"] \ No newline at end of file diff --git a/packages/markitdown-glmocr/src/markitdown_glmocr/__about__.py b/packages/markitdown-glmocr/src/markitdown_glmocr/__about__.py new file mode 100644 index 000000000..b5fdc7530 --- /dev/null +++ b/packages/markitdown-glmocr/src/markitdown_glmocr/__about__.py @@ -0,0 +1 @@ +__version__ = "0.2.2" diff --git a/packages/markitdown-glmocr/src/markitdown_glmocr/__init__.py b/packages/markitdown-glmocr/src/markitdown_glmocr/__init__.py new file mode 100644 index 000000000..45512966a --- /dev/null +++ b/packages/markitdown-glmocr/src/markitdown_glmocr/__init__.py @@ -0,0 +1,12 @@ +"""markitdown-glmocr: Intelligent PDF to Markdown converter using glmocr SDK.""" + +from ._plugin import register_converters +from ._config import GlmOcrConfig +from ._converter import GlmOcrConverter + +__plugin_interface_version__ = 1 +__all__ = [ + "register_converters", + "GlmOcrConfig", + "GlmOcrConverter", +] \ No newline at end of file diff --git a/packages/markitdown-glmocr/src/markitdown_glmocr/_config.py b/packages/markitdown-glmocr/src/markitdown_glmocr/_config.py new file mode 100644 index 000000000..6f2531fb8 --- /dev/null +++ b/packages/markitdown-glmocr/src/markitdown_glmocr/_config.py @@ -0,0 +1,43 @@ +"""Configuration for markitdown-glmocr.""" + +from dataclasses import dataclass, field +from enum import Enum + + +class ScanDetectionMode(str, Enum): + """扫描检测模式。 + + - PAGE_BY_PAGE: 逐页分析,当前默认行为 + - FIRST_PAGE_HINT: 首页是扫描件则全文档使用OCR + - SAMPLING: 抽样前N页,多数是扫描件则全部OCR + """ + PAGE_BY_PAGE = "page_by_page" + FIRST_PAGE_HINT = "first_page_hint" + SAMPLING = "sampling" + + +@dataclass +class GlmOcrConfig: + """markitdown-glmocr configuration. + + Configuration priority (high to low): + 1. Constructor kwargs + 2. Environment variables + 3. .env file + 4. Built-in defaults + """ + + # API configuration + api_key: str = "" # Reads from ZHIPU_API_KEY by default + + # OCR configuration + timeout: int = 1800 + enable_layout: bool = False + + # Processing strategy + force_ai: bool = False + + # Scan detection mode for optimization + scan_detection_mode: ScanDetectionMode = ScanDetectionMode.SAMPLING + scan_sample_pages: int = 3 # Number of pages to sample in SAMPLING mode + scan_text_threshold: int = 50 # Min text length to consider page as non-scanned \ No newline at end of file diff --git a/packages/markitdown-glmocr/src/markitdown_glmocr/_converter.py b/packages/markitdown-glmocr/src/markitdown_glmocr/_converter.py new file mode 100644 index 000000000..35e00900c --- /dev/null +++ b/packages/markitdown-glmocr/src/markitdown_glmocr/_converter.py @@ -0,0 +1,551 @@ +"""GlmOcr PDF/Image Converter - Intelligent PDF and Image to Markdown conversion.""" + +import io +import logging +import sys +from typing import Any, BinaryIO, Optional + +from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo +from markitdown._exceptions import ( + MISSING_DEPENDENCY_MESSAGE, + MissingDependencyException, +) + +from ._config import GlmOcrConfig, ScanDetectionMode + +# Import dependencies +_dependency_exc_info = None +try: + import pdfminer + import pdfminer.high_level + import pdfplumber +except ImportError: + _dependency_exc_info = sys.exc_info() + +# glmocr SDK +try: + import glmocr + from glmocr import GlmOcr +except ImportError: + glmocr = None + GlmOcr = None + + +ACCEPTED_MIME_TYPE_PREFIXES = [ + "application/pdf", + "application/x-pdf", + "image/jpeg", + "image/png", +] + +ACCEPTED_FILE_EXTENSIONS = [".pdf", ".jpg", ".jpeg", ".png"] + + +logger = logging.getLogger(__name__) + + +class GlmOcrConverter(DocumentConverter): + """ + Intelligent PDF/Image converter using glmocr SDK. + + Features: + - Auto-detect page content type (plain text vs images/tables) + - Plain text pages use pdfplumber/pdfminer (fast, free) + - Complex pages use glmocr SDK for AI-powered OCR + - Image files (PNG, JPG) use glmocr SDK directly + - One-liner: glmocr.parse("document.pdf") handles everything + """ + + def __init__( + self, + api_key: Optional[str] = None, + timeout: int = 1800, + enable_layout: bool = False, + force_ai: bool = False, + scan_detection_mode: Optional[ScanDetectionMode] = None, + scan_sample_pages: Optional[int] = None, + scan_text_threshold: Optional[int] = None, + config: Optional[GlmOcrConfig] = None, + ): + """ + Initialize converter. + + Args: + api_key: Zhipu API key (reads from ZHIPU_API_KEY env var if not provided) + timeout: Request timeout in seconds (default: 1800) + enable_layout: Enable layout detection (default: False) + force_ai: Force all pages to use AI (default: False) + scan_detection_mode: 扫描检测模式,优化扫描PDF处理 + scan_sample_pages: SAMPLING模式下抽样页数 (default: 3) + scan_text_threshold: 判定为扫描件的最小文本长度阈值 (default: 50) + config: Optional GlmOcrConfig instance + """ + if glmocr is None: + raise ImportError( + "glmocr is required. Install with: pip install markitdown-glmocr[glmocr]" + ) + + # Use config if provided + if config: + self.api_key = api_key or config.api_key + self.timeout = timeout if timeout != 1800 else config.timeout + self.enable_layout = ( + enable_layout if enable_layout else config.enable_layout + ) + self.force_ai = force_ai or config.force_ai + self.scan_detection_mode = ( + scan_detection_mode + if scan_detection_mode is not None + else config.scan_detection_mode + ) + self.scan_sample_pages = ( + scan_sample_pages + if scan_sample_pages is not None + else config.scan_sample_pages + ) + self.scan_text_threshold = ( + scan_text_threshold + if scan_text_threshold is not None + else config.scan_text_threshold + ) + else: + self.api_key = api_key + self.timeout = timeout + self.enable_layout = enable_layout + self.force_ai = force_ai + self.scan_detection_mode = ( + scan_detection_mode + if scan_detection_mode is not None + else ScanDetectionMode.SAMPLING + ) + self.scan_sample_pages = ( + scan_sample_pages if scan_sample_pages is not None else 3 + ) + self.scan_text_threshold = ( + scan_text_threshold if scan_text_threshold is not None else 50 + ) + + # Lazy init GlmOcr instance + self._glmocr: Optional[GlmOcr] = None + + def _get_glmocr(self) -> GlmOcr: + """Get or create GlmOcr instance.""" + if self._glmocr is None: + kwargs = {"timeout": self.timeout, "enable_layout": self.enable_layout} + if self.api_key: + kwargs["api_key"] = self.api_key + self._glmocr = GlmOcr(**kwargs) + return self._glmocr + + def accepts( + self, + file_stream: BinaryIO, + stream_info: StreamInfo, + **kwargs: Any, + ) -> bool: + mimetype = (stream_info.mimetype or "").lower() + extension = (stream_info.extension or "").lower() + + if extension in ACCEPTED_FILE_EXTENSIONS: + return True + + for prefix in ACCEPTED_MIME_TYPE_PREFIXES: + if mimetype.startswith(prefix): + return True + + return False + + def convert( + self, + file_stream: BinaryIO, + stream_info: StreamInfo, + **kwargs: Any, + ) -> DocumentConverterResult: + if _dependency_exc_info is not None: + raise MissingDependencyException( + MISSING_DEPENDENCY_MESSAGE.format( + converter=type(self).__name__, + extension=".pdf", + feature="pdf", + ) + ) from _dependency_exc_info[1].with_traceback(_dependency_exc_info[2]) + + extension = (stream_info.extension or "").lower() + + logger.info("GlmOcrConverter: 开始转换, 文件类型=%s", extension) + + # Image files: use glmocr directly + if extension in (".jpg", ".jpeg", ".png"): + return self._convert_image(file_stream, extension) + + # PDF files: use hybrid approach + return self._convert_pdf(file_stream) + + def _convert_image( + self, file_stream: BinaryIO, extension: str = ".png" + ) -> DocumentConverterResult: + """Convert image file using glmocr SDK.""" + img_bytes = file_stream.read() + + logger.info("GlmOcrConverter: 开始 OCR 识别图片, 格式=%s", extension) + try: + result = self._get_glmocr().parse(img_bytes) + except Exception as e: + logger.error( + "GlmOcrConverter: 图片 OCR 识别异常, 格式=%s, 错误=%s", extension, e + ) + raise + + # Check for errors + d = result.to_dict() + if "error" in d: + logger.error( + "GlmOcrConverter: 图片 OCR 返回错误, 格式=%s, 错误=%s", + extension, + d["error"], + ) + raise RuntimeError( + f"GlmOcrConverter: glmocr SDK returned error: {d['error']}" + ) + + markdown = result.markdown_result or "" + logger.info("GlmOcrConverter: 图片 OCR 识别完成, 输出长度=%d", len(markdown)) + return DocumentConverterResult(markdown=markdown) + + def _convert_pdf(self, file_stream: BinaryIO) -> DocumentConverterResult: + pdf_stream = io.BytesIO(file_stream.read()) + pdf_bytes = pdf_stream.getvalue() # Keep original bytes for batch OCR + markdown_parts = [] + + with pdfplumber.open(pdf_stream) as pdf: + total_pages = len(pdf.pages) + logger.info("GlmOcrConverter: 开始处理 PDF, 总页数=%d", total_pages) + + # Optimization: detect if entire PDF is scanned + all_scanned = self._detect_all_scanned(pdf) + + if all_scanned and not self.force_ai: + # Batch mode: upload entire PDF to glmocr SDK (single API call) + logger.info( + "GlmOcrConverter: 全文档扫描模式, 批量上传PDF, 页数=%d", + total_pages, + ) + try: + markdown = self._convert_pdf_batch(pdf_bytes) + if markdown.strip(): + logger.info( + "GlmOcrConverter: 批量OCR完成, 输出长度=%d", + len(markdown), + ) + return DocumentConverterResult(markdown=markdown) + except Exception as e: + logger.error( + "GlmOcrConverter: 批量OCR失败, 抛出异常让框架fallback到下一个converter, 错误=%s", + e, + ) + raise + + # Per-page processing (PAGE_BY_PAGE mode or batch failed) + for page_num, page in enumerate(pdf.pages): + # Choose processing method + if self.force_ai or all_scanned: + # All scanned (after batch failed) or force_ai + logger.info( + "GlmOcrConverter: 第 %d/%d 页, 使用 glmocr OCR", + page_num + 1, + total_pages, + ) + try: + markdown = self._convert_with_glmocr(page, page_num) + except Exception as e: + logger.error( + "GlmOcrConverter: 第 %d/%d 页识别异常, 错误=%s", + page_num + 1, + e, + ) + raise + else: + # Per-page analysis (PAGE_BY_PAGE mode or non-scanned doc) + page_type = self._analyze_page(page) + + if page_type != "plain_text": + logger.info( + "GlmOcrConverter: 第 %d/%d 页, 类型=%s, 使用 glmocr OCR", + page_num + 1, + total_pages, + page_type, + ) + try: + markdown = self._convert_with_glmocr(page, page_num) + except Exception as e: + logger.error( + "GlmOcrConverter: 第 %d/%d 页识别异常, 错误=%s", + page_num + 1, + e, + ) + raise + else: + logger.info( + "GlmOcrConverter: 第 %d/%d 页, 类型=%s, 使用 pdfplumber", + page_num + 1, + total_pages, + page_type, + ) + markdown = self._extract_text_with_tables(page) + + if markdown.strip(): + markdown_parts.append(f"## Page {page_num + 1}\n\n{markdown}") + + page.close() + + markdown = "\n\n".join(markdown_parts).strip() + logger.info("GlmOcrConverter: PDF 转换完成, 输出长度=%d", len(markdown)) + return DocumentConverterResult(markdown=markdown) + + def _convert_pdf_batch(self, pdf_bytes: bytes) -> str: + """Convert entire PDF in a single API call. + + More efficient for scanned PDFs: one API call instead of N calls for N pages. + + Args: + pdf_bytes: Raw PDF file content. + + Returns: + Markdown text from all pages. + """ + logger.info( + "GlmOcrConverter: 批量上传PDF到glmocr SDK, 大小=%d bytes", len(pdf_bytes) + ) + result = self._get_glmocr().parse(pdf_bytes) + + # Check for errors + d = result.to_dict() + if "error" in d: + logger.error( + "GlmOcrConverter: 批量OCR返回错误, 错误=%s", + d["error"], + ) + raise RuntimeError( + f"GlmOcrConverter: glmocr SDK batch OCR error: {d['error']}" + ) + + markdown = result.markdown_result or "" + return markdown + + def _analyze_page(self, page: Any) -> str: + """Analyze page content type.""" + # Check for images + if hasattr(page, "images") and page.images: + return "complex" + + # Check for tables + tables = page.find_tables() + if tables: + return "complex" + + # Check for graphics/curves + if hasattr(page, "curves") and page.curves: + return "complex" + + return "plain_text" + + def _is_scanned_page(self, page: Any) -> bool: + """Check if a page is likely a scanned image. + + A page is considered scanned if: + 1. It contains images, AND + 2. It has very little extractable text (below threshold) + + Args: + page: pdfplumber page object + + Returns: + True if the page appears to be a scanned image + """ + # Must have images to be a scan + has_images = hasattr(page, "images") and bool(page.images) + if not has_images: + return False + + # Check extractable text length + try: + text = page.extract_text() or "" + text_len = len(text.strip()) + # If there's substantial text, it might be a mixed page or + # a digital PDF with embedded images + if text_len >= self.scan_text_threshold: + return False + except Exception: + # If text extraction fails, assume it's a scan + return True + + return True + + def _detect_all_scanned(self, pdf: Any) -> bool: + """Detect if entire PDF is scanned based on scan_detection_mode. + + Optimization: When first few pages are scanned, we can assume + all pages are scanned and skip per-page analysis. + + Args: + pdf: pdfplumber PDF object + + Returns: + True if entire PDF should be treated as scanned + """ + if self.scan_detection_mode == ScanDetectionMode.PAGE_BY_PAGE: + return False + + total_pages = len(pdf.pages) + if total_pages == 0: + return False + + if self.scan_detection_mode == ScanDetectionMode.FIRST_PAGE_HINT: + # Check only first page + first_page = pdf.pages[0] + is_scanned = self._is_scanned_page(first_page) + first_page.close() + if is_scanned: + logger.info( + "GlmOcrConverter: 首页检测为扫描件, 模式=FIRST_PAGE_HINT, 全文档使用OCR" + ) + return is_scanned + + if self.scan_detection_mode == ScanDetectionMode.SAMPLING: + # Sample first N pages + sample_count = min(self.scan_sample_pages, total_pages) + scanned_count = 0 + + for i in range(sample_count): + page = pdf.pages[i] + if self._is_scanned_page(page): + scanned_count += 1 + + # If majority of sampled pages are scanned, treat all as scanned + majority_threshold = sample_count // 2 + 1 + all_scanned = scanned_count >= majority_threshold + + if all_scanned: + logger.info( + "GlmOcrConverter: 抽样检测 %d/%d 页为扫描件, 模式=SAMPLING, 全文档使用OCR", + scanned_count, + sample_count, + ) + + return all_scanned + + return False + + def _convert_with_glmocr(self, page: Any, page_num: int) -> str: + """Convert page using glmocr SDK. + + Raises RuntimeError on OCR failure so the framework can try the next converter. + """ + # Render page to image + img = page.to_image(resolution=150) + img_bytes = io.BytesIO() + img.save(img_bytes, format="PNG") + + logger.info("GlmOcrConverter: glmocr SDK 开始识别第 %d 页", page_num + 1) + try: + result = self._get_glmocr().parse(img_bytes.getvalue()) + except Exception as e: + logger.error( + "GlmOcrConverter: glmocr SDK 第 %d 页识别异常, 错误=%s", page_num + 1, e + ) + raise + + # Check for errors + d = result.to_dict() + if "error" in d: + logger.error( + "GlmOcrConverter: glmocr SDK 第 %d 页返回错误, 错误=%s", + page_num + 1, + d["error"], + ) + raise RuntimeError( + f"GlmOcrConverter: glmocr SDK returned error on page {page_num + 1}: {d['error']}" + ) + + markdown = result.markdown_result or "" + logger.info( + "GlmOcrConverter: glmocr SDK 第 %d 页识别完成, 输出长度=%d", + page_num + 1, + len(markdown), + ) + return markdown + + def _extract_text_with_tables(self, page: Any) -> str: + """Extract text and tables from page.""" + parts = [] + + # Extract text + text = page.extract_text() or "" + if text.strip(): + parts.append(text.strip()) + + # Extract tables + try: + tables = page.extract_tables() + if tables: + for table in tables: + if table: + md_table = self._table_to_markdown(table) + if md_table.strip(): + parts.append(md_table) + except Exception: + pass + + return "\n\n".join(parts) + + def _table_to_markdown(self, table: list[list[str]]) -> str: + """Convert table to Markdown.""" + if not table: + return "" + + # Filter None values + table = [[cell if cell is not None else "" for cell in row] for row in table] + + # Filter empty rows + table = [row for row in table if any(cell.strip() for cell in row)] + + if not table: + return "" + + # Calculate column widths + col_widths = [ + max(len(str(row[i])) if i < len(row) else 0 for row in table) + for i in range(max(len(row) for row in table)) + ] + + # Format table + lines = [] + for row_idx, row in enumerate(table): + padded_row = row + [""] * (len(col_widths) - len(row)) + line = ( + "| " + + " | ".join( + str(cell).ljust(width) + for cell, width in zip(padded_row, col_widths) + ) + + " |" + ) + lines.append(line) + + if row_idx == 0: + sep = "|" + "|".join("-" * (w + 2) for w in col_widths) + "|" + lines.append(sep) + + return "\n".join(lines) + + def close(self): + """Close the GlmOcr instance.""" + if self._glmocr: + self._glmocr.close() + self._glmocr = None + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + self.close() diff --git a/packages/markitdown-glmocr/src/markitdown_glmocr/_plugin.py b/packages/markitdown-glmocr/src/markitdown_glmocr/_plugin.py new file mode 100644 index 000000000..025a5ffd7 --- /dev/null +++ b/packages/markitdown-glmocr/src/markitdown_glmocr/_plugin.py @@ -0,0 +1,46 @@ +"""Plugin registration for markitdown-glmocr.""" + +import logging +from typing import Any + +from markitdown import MarkItDown + +from ._converter import GlmOcrConverter + +__plugin_interface_version__ = 1 + +logger = logging.getLogger(__name__) + + +def register_converters(markitdown: MarkItDown, **kwargs: Any) -> None: + """ + Register markitdown-glmocr converter. + + Config sources (priority high to low): + 1. kwargs parameters + 2. Environment variables (ZHIPU_API_KEY) + 3. .env file + 4. Built-in defaults + """ + logger.info("markitdown-glmocr: 开始注册插件") + + # Register converter + # Priority -1.0: same level as PaddleOcrConverter, + # the upper-level agent's skills control which plugin to call first. + PRIORITY_GLMOCR = -1.0 + + try: + converter = GlmOcrConverter( + api_key=kwargs.get("api_key"), + timeout=kwargs.get("timeout", 1800), + enable_layout=kwargs.get("enable_layout", False), + force_ai=kwargs.get("force_ai", False), + ) + markitdown.register_converter( + converter, + priority=PRIORITY_GLMOCR, + ) + logger.info("markitdown-glmocr: 插件注册成功, priority=%.1f", PRIORITY_GLMOCR) + except Exception as e: + logger.error("markitdown-glmocr: 插件注册失败, 错误=%s", e) + raise diff --git a/packages/markitdown-glmocr/tests/__init__.py b/packages/markitdown-glmocr/tests/__init__.py new file mode 100644 index 000000000..dfa7b4968 --- /dev/null +++ b/packages/markitdown-glmocr/tests/__init__.py @@ -0,0 +1 @@ +"""Tests for nova-pdf converter.""" \ No newline at end of file diff --git a/packages/markitdown-glmocr/tests/test_converter.py b/packages/markitdown-glmocr/tests/test_converter.py new file mode 100644 index 000000000..d91c7d995 --- /dev/null +++ b/packages/markitdown-glmocr/tests/test_converter.py @@ -0,0 +1,125 @@ +"""Tests for markitdown-glmocr converter.""" + +import io +import pytest +from unittest.mock import MagicMock, patch, PropertyMock + +from markitdown_glmocr._converter import GlmOcrConverter +from markitdown_glmocr._config import ScanDetectionMode + + +class TestGlmOcrConverter: + """Converter tests.""" + + @patch("markitdown_glmocr._converter.glmocr") + def test_accepts_pdf_extension(self, mock_glmocr): + """Accept .pdf extension.""" + converter = GlmOcrConverter() + stream = io.BytesIO(b"%PDF-1.4") + stream_info = MagicMock(extension=".pdf", mimetype=None) + + assert converter.accepts(stream, stream_info) is True + + @patch("markitdown_glmocr._converter.glmocr") + def test_accepts_pdf_mimetype(self, mock_glmocr): + """Accept PDF MIME type.""" + converter = GlmOcrConverter() + stream = io.BytesIO(b"%PDF-1.4") + stream_info = MagicMock(extension=None, mimetype="application/pdf") + + assert converter.accepts(stream, stream_info) is True + + @patch("markitdown_glmocr._converter.glmocr") + def test_rejects_non_pdf(self, mock_glmocr): + """Reject non-PDF files.""" + converter = GlmOcrConverter() + stream = io.BytesIO(b"not a pdf") + stream_info = MagicMock(extension=".txt", mimetype="text/plain") + + assert converter.accepts(stream, stream_info) is False + + @patch("markitdown_glmocr._converter.glmocr") + def test_table_to_markdown(self, mock_glmocr): + """Table to Markdown conversion.""" + converter = GlmOcrConverter() + table = [ + ["Name", "Age", "City"], + ["Alice", "25", "Beijing"], + ["Bob", "30", "Shanghai"], + ] + + result = converter._table_to_markdown(table) + + assert "|" in result + assert "Name" in result + assert "Alice" in result + assert "---" in result # Separator + + @patch("markitdown_glmocr._converter.glmocr") + def test_plain_text_page_without_ai(self, mock_glmocr): + """Plain text page without AI.""" + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.PAGE_BY_PAGE, + ) + + # Mock page + page = MagicMock() + page.images = [] + page.find_tables.return_value = [] + page.curves = [] + page.extract_text.return_value = "Hello World" + page.extract_tables.return_value = [] + page.close = MagicMock() + + # Mock PDF + mock_pdf = MagicMock() + mock_pdf.pages = [page] + + with patch("markitdown_glmocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = mock_pdf + + stream = io.BytesIO(b"%PDF-1.4") + result = converter.convert(stream, MagicMock()) + + assert "Hello World" in result.markdown + + @patch("markitdown_glmocr._converter.glmocr") + def test_force_ai_mode(self, mock_glmocr): + """Force AI mode.""" + # Mock glmocr instance + mock_result = MagicMock() + mock_result.markdown_result = "AI result" + mock_result.to_dict.return_value = {} + + mock_glmocr_instance = MagicMock() + mock_glmocr_instance.parse.return_value = mock_result + mock_glmocr.GlmOcr.return_value = mock_glmocr_instance + + converter = GlmOcrConverter(force_ai=True) + # Force initialization of the mocked glmocr + converter._get_glmocr = lambda: mock_glmocr_instance + + # Even plain text page + page = MagicMock() + page.images = [] + page.find_tables.return_value = [] + page.curves = [] + page.extract_text.return_value = "Plain text" + page.extract_tables.return_value = [] + page.close = MagicMock() + + # Mock to_image + mock_img = MagicMock() + page.to_image.return_value = mock_img + + mock_pdf = MagicMock() + mock_pdf.pages = [page] + + with patch("markitdown_glmocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = mock_pdf + + stream = io.BytesIO(b"%PDF-1.4") + result = converter.convert(stream, MagicMock()) + + # Should call AI (because force_ai=True) + mock_glmocr_instance.parse.assert_called_once() diff --git a/packages/markitdown-glmocr/tests/test_scan_detection.py b/packages/markitdown-glmocr/tests/test_scan_detection.py new file mode 100644 index 000000000..01b2442a6 --- /dev/null +++ b/packages/markitdown-glmocr/tests/test_scan_detection.py @@ -0,0 +1,437 @@ +"""Tests for scan detection optimization in GlmOcrConverter.""" + +import pytest +from unittest.mock import MagicMock, patch + +from markitdown_glmocr._config import GlmOcrConfig, ScanDetectionMode +from markitdown_glmocr._converter import GlmOcrConverter + + +class TestScanDetectionMode: + """扫描检测模式配置测试""" + + def test_default_mode_is_sampling(self): + """默认模式应为 SAMPLING""" + config = GlmOcrConfig() + assert config.scan_detection_mode == ScanDetectionMode.SAMPLING + + def test_custom_mode_from_config(self): + """从配置对象读取自定义模式""" + with patch("markitdown_glmocr._converter.glmocr"): + config = GlmOcrConfig(scan_detection_mode=ScanDetectionMode.FIRST_PAGE_HINT) + converter = GlmOcrConverter(config=config) + assert converter.scan_detection_mode == ScanDetectionMode.FIRST_PAGE_HINT + + def test_custom_mode_from_constructor(self): + """从构造函数传入自定义模式""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.PAGE_BY_PAGE, + ) + assert converter.scan_detection_mode == ScanDetectionMode.PAGE_BY_PAGE + + def test_constructor_overrides_config(self): + """构造函数参数优先于配置对象""" + with patch("markitdown_glmocr._converter.glmocr"): + config = GlmOcrConfig(scan_detection_mode=ScanDetectionMode.FIRST_PAGE_HINT) + converter = GlmOcrConverter( + config=config, + scan_detection_mode=ScanDetectionMode.PAGE_BY_PAGE, + ) + assert converter.scan_detection_mode == ScanDetectionMode.PAGE_BY_PAGE + + +class TestIsScannedPage: + """扫描页面检测测试""" + + def test_page_without_images_not_scanned(self): + """无图片的页面不是扫描件""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter() + + page = MagicMock() + page.images = [] + page.extract_text.return_value = "Some text content here" + + assert converter._is_scanned_page(page) is False + + def test_page_with_images_and_text_not_scanned(self): + """有图片但有足够文本的页面不是扫描件""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter(scan_text_threshold=50) + + page = MagicMock() + page.images = [MagicMock()] + page.extract_text.return_value = "This is more than 50 characters of text content that should be extracted" + + assert converter._is_scanned_page(page) is False + + def test_page_with_images_no_text_is_scanned(self): + """有图片但无文本的页面是扫描件""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter(scan_text_threshold=50) + + page = MagicMock() + page.images = [MagicMock()] + page.extract_text.return_value = "" + + assert converter._is_scanned_page(page) is True + + def test_page_with_images_little_text_is_scanned(self): + """有图片但文本少于阈值的页面是扫描件""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter(scan_text_threshold=50) + + page = MagicMock() + page.images = [MagicMock()] + page.extract_text.return_value = "Short text" # Only 10 chars + + assert converter._is_scanned_page(page) is True + + def test_text_extraction_error_assumes_scanned(self): + """文本提取失败时假定是扫描件""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter() + + page = MagicMock() + page.images = [MagicMock()] + page.extract_text.side_effect = Exception("Extraction failed") + + assert converter._is_scanned_page(page) is True + + def test_custom_threshold(self): + """自定义阈值生效""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter(scan_text_threshold=100) + + # Text below threshold + page1 = MagicMock() + page1.images = [MagicMock()] + page1.extract_text.return_value = "This is exactly 50 characters" # ~30 chars + + assert converter._is_scanned_page(page1) is True + + # Text above threshold + page2 = MagicMock() + page2.images = [MagicMock()] + page2.extract_text.return_value = "This is definitely more than 100 characters of text content here for testing and verification purposes" # 106 chars + + assert converter._is_scanned_page(page2) is False + + +class TestDetectAllScanned: + """全文档扫描检测测试""" + + def test_page_by_page_mode_returns_false(self): + """PAGE_BY_PAGE 模式永远返回 False""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.PAGE_BY_PAGE, + ) + + # Even with all scanned pages + pdf = MagicMock() + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + pdf.pages = [scanned_page, scanned_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is False + + def test_first_page_hint_first_page_scanned(self): + """FIRST_PAGE_HINT 模式,首页扫描则全文档扫描""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.FIRST_PAGE_HINT, + ) + + # First page scanned + pdf = MagicMock() + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + pdf.pages = [scanned_page, normal_page, normal_page] + + assert converter._detect_all_scanned(pdf) is True + + def test_first_page_hint_first_page_not_scanned(self): + """FIRST_PAGE_HINT 模式,首页非扫描则不判定全扫描""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.FIRST_PAGE_HINT, + ) + + # First page not scanned + pdf = MagicMock() + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + pdf.pages = [normal_page, scanned_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is False + + def test_sampling_mode_majority_scanned(self): + """SAMPLING 模式,多数页面扫描则全文档扫描""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # 3 pages, 2 scanned, 1 normal -> majority scanned + pdf = MagicMock() + + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + pdf.pages = [scanned_page, scanned_page, normal_page] + + assert converter._detect_all_scanned(pdf) is True + + def test_sampling_mode_minority_scanned(self): + """SAMPLING 模式,少数页面扫描则不判定全扫描""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # 3 pages, 1 scanned, 2 normal -> minority scanned + pdf = MagicMock() + + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + pdf.pages = [normal_page, normal_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is False + + def test_sampling_mode_all_scanned(self): + """SAMPLING 模式,所有抽样页扫描则全文档扫描""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + pdf = MagicMock() + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + pdf.pages = [scanned_page, scanned_page, scanned_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is True + + def test_sampling_mode_custom_sample_count(self): + """SAMPLING 模式,自定义抽样页数""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=5, + ) + + # 5 pages sampled, 3 scanned -> majority + pdf = MagicMock() + + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + pdf.pages = [scanned_page, scanned_page, scanned_page, normal_page, normal_page] + + assert converter._detect_all_scanned(pdf) is True + + def test_empty_pdf_returns_false(self): + """空 PDF 返回 False""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter() + + pdf = MagicMock() + pdf.pages = [] + + assert converter._detect_all_scanned(pdf) is False + + def test_pdf_with_less_pages_than_sample_count(self): + """PDF 页数少于抽样数时使用实际页数""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=5, + ) + + # Only 2 pages, both scanned -> majority + pdf = MagicMock() + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + pdf.pages = [scanned_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is True + + +class TestConvertPdfWithScanDetection: + """PDF 转换中的扫描检测集成测试""" + + def test_all_scanned_uses_batch_mode(self): + """全扫描模式优先使用批量上传""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # Mock _detect_all_scanned to return True + converter._detect_all_scanned = MagicMock(return_value=True) + converter._convert_pdf_batch = MagicMock(return_value="Batch OCR result") + converter._convert_with_glmocr = MagicMock(return_value="Page OCR result") + + # Mock PDF + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + + pdf = MagicMock() + pdf.pages = [scanned_page, scanned_page] + + with patch("markitdown_glmocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = pdf + + import io + stream = io.BytesIO(b"%PDF-1.4") + result = converter._convert_pdf(stream) + + # Should call batch mode (1 API call) + converter._convert_pdf_batch.assert_called_once() + # Should NOT call per-page OCR + converter._convert_with_glmocr.assert_not_called() + assert "Batch OCR result" in result.markdown + + def test_batch_failure_fallback_to_per_page(self): + """批量OCR失败后降级为逐页处理""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # Mock _detect_all_scanned to return True + converter._detect_all_scanned = MagicMock(return_value=True) + converter._convert_pdf_batch = MagicMock(side_effect=RuntimeError("Batch API error")) + converter._convert_with_glmocr = MagicMock(return_value="Page OCR result") + + # Mock PDF + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + + pdf = MagicMock() + pdf.pages = [scanned_page, scanned_page] + + with patch("markitdown_glmocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = pdf + + import io + stream = io.BytesIO(b"%PDF-1.4") + result = converter._convert_pdf(stream) + + # Should have tried batch first + converter._convert_pdf_batch.assert_called_once() + # Should fall back to per-page OCR + assert converter._convert_with_glmocr.call_count == 2 + + def test_all_scanned_skips_per_page_analysis(self): + """全扫描模式跳过逐页分析""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # Mock _detect_all_scanned to return True + converter._detect_all_scanned = MagicMock(return_value=True) + converter._convert_pdf_batch = MagicMock(return_value="Batch OCR result") + converter._analyze_page = MagicMock(return_value="plain_text") + + # Mock PDF + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + + pdf = MagicMock() + pdf.pages = [scanned_page, scanned_page] + + with patch("markitdown_glmocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = pdf + + import io + stream = io.BytesIO(b"%PDF-1.4") + result = converter._convert_pdf(stream) + + # Should call batch mode, not _analyze_page + converter._convert_pdf_batch.assert_called_once() + converter._analyze_page.assert_not_called() + + def test_page_by_page_mode_analyzes_each_page(self): + """PAGE_BY_PAGE 模式分析每页""" + with patch("markitdown_glmocr._converter.glmocr"): + converter = GlmOcrConverter( + scan_detection_mode=ScanDetectionMode.PAGE_BY_PAGE, + ) + + # Mock _analyze_page to return different results + converter._analyze_page = MagicMock(side_effect=["plain_text", "complex"]) + converter._convert_with_glmocr = MagicMock(return_value="OCR result") + converter._extract_text_with_tables = MagicMock(return_value="Text result") + + # Mock PDF + page1 = MagicMock() + page1.close = MagicMock() + page2 = MagicMock() + page2.close = MagicMock() + + pdf = MagicMock() + pdf.pages = [page1, page2] + + with patch("markitdown_glmocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = pdf + + import io + stream = io.BytesIO(b"%PDF-1.4") + result = converter._convert_pdf(stream) + + # Should analyze each page + assert converter._analyze_page.call_count == 2 + # Should use different methods for different pages + converter._extract_text_with_tables.assert_called_once() + converter._convert_with_glmocr.assert_called_once() \ No newline at end of file diff --git a/packages/markitdown-paddleocr/README.md b/packages/markitdown-paddleocr/README.md new file mode 100644 index 000000000..e64f7120a --- /dev/null +++ b/packages/markitdown-paddleocr/README.md @@ -0,0 +1,244 @@ +# markitdown-paddleocr + +智能 PDF/图片转 Markdown 插件,使用百度 PaddleOCR 云端 API 驱动的 OCR 识别。 + +## 特性 + +- 🔍 **智能检测**:自动识别每页内容类型(纯文本 vs 图片/表格) +- 📄 **默认解析**:纯文本页面使用 pdfplumber/pdfminer 提取,速度快、成本低 +- 🤖 **AI 增强**:复杂页面(图片、表格)使用 PaddleOCR API 转换为 Markdown +- 🔄 **异步 Job 模型**:提交 OCR 任务 → 轮询状态 → 获取结果 +- 📊 **结构化输出**:返回 Markdown(含表格、公式、图表等) + +## 安装 + +```bash +pip install markitdown-paddleocr +``` + +## 配置 + +### 环境变量(推荐) + +```bash +# 必需:百度 PaddleOCR Token +export BAIDU_PADDLE_TOKEN="your-paddle-token" + +# 可选 +export PADDLE_OCR_MODEL="PaddleOCR-VL-1.6" # 模型名称 +``` + +### 配置优先级 + +``` +构造函数参数 > 环境变量 > 内置默认值 +``` + +## 使用方法 + +### 命令行(推荐) + +```bash +# 1. 设置 Token +export BAIDU_PADDLE_TOKEN="your-token" + +# 2. 查看已安装插件 +markitdown --list-plugins + +# 3. 使用插件转换 PDF +markitdown -p document.pdf + +# 4. 保存到文件 +markitdown -p document.pdf -o output.md +``` + +### Python API + +```python +from markitdown import MarkItDown +from markitdown_paddleocr import PaddleOcrConverter + +# 方式1:自动从环境变量读取 BAIDU_PADDLE_TOKEN +converter = PaddleOcrConverter() +md = MarkItDown(enable_plugins=False) +md.register_converter(converter, priority=-1.0) +result = md.convert("document.pdf") +print(result.markdown) + +# 方式2:手动传入 Token +converter = PaddleOcrConverter(token="your-token") +md = MarkItDown(enable_plugins=False) +md.register_converter(converter, priority=-1.0) +result = md.convert("document.pdf") +print(result.markdown) + +# 方式3:强制所有页面使用 OCR +converter = PaddleOcrConverter(token="your-token", force_ai=True) +md = MarkItDown(enable_plugins=False) +md.register_converter(converter, priority=-1.0) +result = md.convert("document.pdf") +print(result.markdown) +``` + +### 直接使用 PaddleClient + +```python +from markitdown_paddleocr import PaddleClient + +client = PaddleClient(token="your-token") + +# 本地文件 +markdown = client.ocr(file_bytes=open("image.png", "rb").read(), filename="image.png") +print(markdown) + +# URL 模式 +markdown = client.ocr(file_url="https://example.com/document.pdf") +print(markdown) +``` + +## 配置选项 + +### PaddleOcrConverter 参数 + +| 参数 | 类型 | 默认值 | 说明 | +|------|------|--------|------| +| `token` | str | 环境变量 `BAIDU_PADDLE_TOKEN` | PaddleOCR Token | +| `model` | str | `PaddleOCR-VL-1.6` | OCR 模型名称 | +| `poll_interval` | float | 2.0 | 轮询间隔(秒) | +| `poll_timeout` | float | 300.0 | 轮询超时(秒) | +| `force_ai` | bool | False | 强制所有页面使用 OCR | +| `use_doc_orientation_classify` | bool | False | 文档方向分类 | +| `use_doc_unwarping` | bool | False | 文档去扭曲 | +| `use_chart_recognition` | bool | False | 图表识别 | + +### 环境变量 + +| 变量 | 说明 | 示例 | +|------|------|------| +| `BAIDU_PADDLE_TOKEN` | Token(必需) | `7963b85a...` | +| `PADDLE_OCR_MODEL` | 模型名称 | `PaddleOCR-VL-1.6` | + +## 工作原理 + +``` +PDF/图片 输入 + │ + ▼ +PaddleOcrConverter.convert() + │ + ├─ 图片文件 ──► PaddleClient.ocr() ──► markdown + │ + └─ PDF 文件 ──► 逐页分析内容类型 + │ + ├─ 纯文本页 ──► pdfplumber 提取文本 + │ + └─ 复杂页(图片/表格) + │ + └─► 渲染为图片 ──► PaddleClient.ocr() + │ + ├─ POST /api/v2/ocr/jobs (提交 Job) + ├─ GET /api/v2/ocr/jobs/{id} (轮询状态) + └─ GET jsonUrl (获取 JSONL 结果) + │ + ▼ +合并输出完整 Markdown +``` + +## 依赖 + +- `markitdown>=0.1.0` - 基础框架 +- `pdfplumber>=0.11.9` - PDF 解析和截图 +- `pdfminer.six>=20251230` - 文本提取备用 +- `Pillow>=9.0.0` - 图像处理 +- `requests>=2.28.0` - HTTP 请求 + +## 发布到 PyPI + +### 前置条件 + +1. 安装构建工具: + +```bash +pip install build twine hatch +``` + +2. 配置 PyPI API Token(Windows 用户环境变量): + +```powershell +# PowerShell 设置用户环境变量 +[System.Environment]::SetEnvironmentVariable('PYPI_API_TOKEN', 'pypi-...', 'User') +``` + +或在 Bash/Zsh 中: + +```bash +export PYPI_API_TOKEN="pypi-..." +``` + +### 快速发布(推荐) + +项目根目录提供了上传脚本,可一键发布两个插件: + +**Bash / Git Bash:** +```bash +# 构建两个插件 +cd packages/markitdown-glmocr && hatch build + +cd ../markitdown-paddleocr && hatch build + +# 上传(自动上传所有构建的版本) +cd ../.. +./scripts/pypi-upload.sh + +# 或指定版本号 +./scripts/pypi-upload.sh 0.2.0 +``` + +**PowerShell:** +```powershell +# 构建两个插件 +cd packages/markitdown-glmocr; hatch build +cd ../markitdown-paddleocr; hatch build + +# 上传 +cd ../.. +.\scripts\pypi-upload.ps1 + +# 或指定版本号 +.\scripts\pypi-upload.ps1 -Version "0.2.0" +``` + +### 手动发布 + +```bash +# 1. 进入项目目录 +cd packages/markitdown-paddleocr + +# 2. 构建 +hatch build + +# 3. 检查 +twine check dist/* + +# 4. 上传 +twine upload --username __token__ --password "$PYPI_API_TOKEN" --disable-progress-bar dist/* +``` + +### 发布到 TestPyPI(测试) + +```bash +twine upload --repository testpypi --username __token__ --password "$PYPI_API_TOKEN" --disable-progress-bar dist/* + +# 从 TestPyPI 安装验证 +pip install --index-url https://test.pypi.org/simple/ markitdown-paddleocr +``` + +### 注意事项 + +- 发布前确保 `src/markitdown_paddleocr/__about__.py` 中的版本号已更新 +- 同一版本号不能重复上传,如需修正必须 bump 版本号 +- `PYPI_API_TOKEN` 切勿提交到代码仓库 + +## 许可证 + +MIT diff --git a/packages/markitdown-paddleocr/pyproject.toml b/packages/markitdown-paddleocr/pyproject.toml new file mode 100644 index 000000000..f3326cd04 --- /dev/null +++ b/packages/markitdown-paddleocr/pyproject.toml @@ -0,0 +1,58 @@ +[build-system] +requires = ["hatchling"] +build-backend = "hatchling.build" + +[project] +name = "markitdown-paddleocr" +dynamic = ["version"] +description = "Intelligent PDF/Image to Markdown converter using PaddleOCR cloud API" +readme = "README.md" +requires-python = ">=3.10" +license = "MIT" +keywords = ["markitdown", "pdf", "ocr", "paddleocr", "baidu", "vision"] +authors = [ + { name = "Contributors", email = "noreply@github.com" }, +] +classifiers = [ + "Development Status :: 4 - Beta", + "Programming Language :: Python", + "Programming Language :: Python :: 3.10", + "Programming Language :: Python :: 3.11", + "Programming Language :: Python :: 3.12", + "Programming Language :: Python :: 3.13", +] + +dependencies = [ + "markitdown>=0.1.0", + "pdfminer.six>=20251230", + "pdfplumber>=0.11.9", + "Pillow>=9.0.0", + "requests>=2.28.0", +] + +[project.optional-dependencies] +dev = [ + "pytest>=7.0.0", +] + +[project.urls] +Documentation = "https://github.com/microsoft/markitdown#readme" +Issues = "https://github.com/microsoft/markitdown/issues" +Source = "https://github.com/microsoft/markitdown" + +[tool.hatch.version] +path = "src/markitdown_paddleocr/__about__.py" + +# Plugin entry point - MarkItDown will discover this plugin +[project.entry-points."markitdown.plugin"] +markitdown_paddleocr = "markitdown_paddleocr" + +[tool.hatch.build.targets.sdist] +only-include = ["src/markitdown_paddleocr"] + +[tool.hatch.build.targets.wheel] +packages = ["src/markitdown_paddleocr"] + +[tool.pytest.ini_options] +testpaths = ["tests"] +python_files = ["test_*.py"] diff --git a/packages/markitdown-paddleocr/src/markitdown_paddleocr/__about__.py b/packages/markitdown-paddleocr/src/markitdown_paddleocr/__about__.py new file mode 100644 index 000000000..d31c31eae --- /dev/null +++ b/packages/markitdown-paddleocr/src/markitdown_paddleocr/__about__.py @@ -0,0 +1 @@ +__version__ = "0.2.3" diff --git a/packages/markitdown-paddleocr/src/markitdown_paddleocr/__init__.py b/packages/markitdown-paddleocr/src/markitdown_paddleocr/__init__.py new file mode 100644 index 000000000..00b431621 --- /dev/null +++ b/packages/markitdown-paddleocr/src/markitdown_paddleocr/__init__.py @@ -0,0 +1,16 @@ +"""markitdown-paddleocr: PDF/Image to Markdown converter using PaddleOCR cloud API.""" + +from ._plugin import register_converters +from ._config import PaddleOcrConfig +from ._converter import PaddleOcrConverter +from ._paddle_client import PaddleClient +from ._dual_converter import DualOcrConverter + +__plugin_interface_version__ = 1 +__all__ = [ + "register_converters", + "PaddleOcrConfig", + "PaddleOcrConverter", + "PaddleClient", + "DualOcrConverter", +] diff --git a/packages/markitdown-paddleocr/src/markitdown_paddleocr/_config.py b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_config.py new file mode 100644 index 000000000..e66bb21e6 --- /dev/null +++ b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_config.py @@ -0,0 +1,65 @@ +"""Configuration for markitdown-paddleocr.""" + +import os +from dataclasses import dataclass +from enum import Enum + + +class ScanDetectionMode(str, Enum): + """扫描检测模式。 + + - PAGE_BY_PAGE: 逐页分析,当前默认行为 + - FIRST_PAGE_HINT: 首页是扫描件则全文档使用OCR + - SAMPLING: 抽样前N页,多数是扫描件则全部OCR + """ + + PAGE_BY_PAGE = "page_by_page" + FIRST_PAGE_HINT = "first_page_hint" + SAMPLING = "sampling" + + +@dataclass +class PaddleOcrConfig: + """markitdown-paddleocr configuration. + + Configuration priority (high to low): + 1. Constructor kwargs + 2. Environment variables + 3. Built-in defaults + """ + + # API configuration + token: str = "" # Reads from BAIDU_PADDLE_TOKEN by default + + # OCR model + model: str = "PaddleOCR-VL-1.6" + + # API endpoint + job_url: str = "https://paddleocr.aistudio-app.com/api/v2/ocr/jobs" + + # Polling configuration + poll_interval: float = 2.0 # seconds between polls + poll_timeout: float = 300.0 # max seconds to wait for job completion + + # Optional OCR features + use_doc_orientation_classify: bool = False + use_doc_unwarping: bool = False + use_chart_recognition: bool = False + + # Processing strategy + force_ai: bool = False + + # Scan detection mode for optimization + scan_detection_mode: ScanDetectionMode = ScanDetectionMode.SAMPLING + scan_sample_pages: int = 3 # Number of pages to sample in SAMPLING mode + scan_text_threshold: int = 50 # Min text length to consider page as non-scanned + + @classmethod + def from_env(cls, **overrides) -> "PaddleOcrConfig": + """Create config from environment variables with optional overrides.""" + defaults = { + "token": os.environ.get("BAIDU_PADDLE_TOKEN", ""), + "model": os.environ.get("PADDLE_OCR_MODEL", "PaddleOCR-VL-1.6"), + } + defaults.update(overrides) + return cls(**defaults) diff --git a/packages/markitdown-paddleocr/src/markitdown_paddleocr/_converter.py b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_converter.py new file mode 100644 index 000000000..6a11b8c85 --- /dev/null +++ b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_converter.py @@ -0,0 +1,574 @@ +"""PaddleOcr Converter - PDF/Image to Markdown using PaddleOCR cloud API.""" + +import io +import logging +import sys +from typing import Any, BinaryIO, Optional + +from markitdown import DocumentConverter, DocumentConverterResult, StreamInfo +from markitdown._exceptions import ( + MISSING_DEPENDENCY_MESSAGE, + MissingDependencyException, +) + +from ._config import PaddleOcrConfig, ScanDetectionMode +from ._paddle_client import PaddleClient + +# Import PDF dependencies +_dependency_exc_info = None +try: + import pdfminer + import pdfminer.high_level + import pdfplumber +except ImportError: + _dependency_exc_info = sys.exc_info() + + +ACCEPTED_MIME_TYPE_PREFIXES = [ + "application/pdf", + "application/x-pdf", + "image/jpeg", + "image/png", +] + +ACCEPTED_FILE_EXTENSIONS = [".pdf", ".jpg", ".jpeg", ".png"] + + +logger = logging.getLogger(__name__) + + +class PaddleOcrConverter(DocumentConverter): + """Intelligent PDF/Image converter using PaddleOCR cloud API. + + Features: + - Auto-detect page content type (plain text vs images/tables) + - Plain text pages use pdfplumber/pdfminer (fast, free) + - Complex pages use PaddleOCR API for AI-powered OCR + - Image files (PNG, JPG) use PaddleOCR API directly + - Asynchronous job model: submit → poll → fetch result + """ + + def __init__( + self, + token: Optional[str] = None, + model: str = "PaddleOCR-VL-1.6", + poll_interval: float = 2.0, + poll_timeout: float = 300.0, + force_ai: bool = False, + use_doc_orientation_classify: bool = False, + use_doc_unwarping: bool = False, + use_chart_recognition: bool = False, + scan_detection_mode: Optional[ScanDetectionMode] = None, + scan_sample_pages: Optional[int] = None, + scan_text_threshold: Optional[int] = None, + config: Optional[PaddleOcrConfig] = None, + ): + """Initialize converter. + + Args: + token: Baidu PaddleOCR token (reads from BAIDU_PADDLE_TOKEN env var if not provided) + model: OCR model name (default: PaddleOCR-VL-1.6) + poll_interval: Seconds between status polls (default: 2.0) + poll_timeout: Max seconds to wait for job completion (default: 300.0) + force_ai: Force all pages to use OCR (default: False) + use_doc_orientation_classify: Enable document orientation classification + use_doc_unwarping: Enable document unwarping + use_chart_recognition: Enable chart recognition + scan_detection_mode: 扫描检测模式,优化扫描PDF处理 + scan_sample_pages: SAMPLING模式下抽样页数 (default: 3) + scan_text_threshold: 判定为扫描件的最小文本长度阈值 (default: 50) + config: Optional PaddleOcrConfig instance + """ + # Build config from explicit params or provided config + if config: + self.token = token or config.token + self.model = model if model != "PaddleOCR-VL-1.6" else config.model + self.poll_interval = ( + poll_interval if poll_interval != 2.0 else config.poll_interval + ) + self.poll_timeout = ( + poll_timeout if poll_timeout != 300.0 else config.poll_timeout + ) + self.force_ai = force_ai or config.force_ai + self.use_doc_orientation_classify = ( + use_doc_orientation_classify or config.use_doc_orientation_classify + ) + self.use_doc_unwarping = use_doc_unwarping or config.use_doc_unwarping + self.use_chart_recognition = ( + use_chart_recognition or config.use_chart_recognition + ) + self.scan_detection_mode = ( + scan_detection_mode + if scan_detection_mode is not None + else config.scan_detection_mode + ) + self.scan_sample_pages = ( + scan_sample_pages + if scan_sample_pages is not None + else config.scan_sample_pages + ) + self.scan_text_threshold = ( + scan_text_threshold + if scan_text_threshold is not None + else config.scan_text_threshold + ) + else: + self.token = token + self.model = model + self.poll_interval = poll_interval + self.poll_timeout = poll_timeout + self.force_ai = force_ai + self.use_doc_orientation_classify = use_doc_orientation_classify + self.use_doc_unwarping = use_doc_unwarping + self.use_chart_recognition = use_chart_recognition + self.scan_detection_mode = ( + scan_detection_mode + if scan_detection_mode is not None + else ScanDetectionMode.SAMPLING + ) + self.scan_sample_pages = ( + scan_sample_pages if scan_sample_pages is not None else 3 + ) + self.scan_text_threshold = ( + scan_text_threshold if scan_text_threshold is not None else 50 + ) + + # Lazy init client + self._client: Optional[PaddleClient] = None + + def _get_client(self) -> PaddleClient: + """Get or create PaddleClient instance.""" + if self._client is None: + config = PaddleOcrConfig( + token=self.token or "", + model=self.model, + poll_interval=self.poll_interval, + poll_timeout=self.poll_timeout, + force_ai=self.force_ai, + use_doc_orientation_classify=self.use_doc_orientation_classify, + use_doc_unwarping=self.use_doc_unwarping, + use_chart_recognition=self.use_chart_recognition, + ) + self._client = PaddleClient(config=config) + return self._client + + def _has_token(self) -> bool: + """Check if a valid token is available.""" + if self.token: + return True + import os + + return bool(os.environ.get("BAIDU_PADDLE_TOKEN", "")) + + def accepts( + self, + file_stream: BinaryIO, + stream_info: StreamInfo, + **kwargs: Any, + ) -> bool: + # Without a token, PaddleOCR API cannot work — decline so other + # converters (e.g. GlmOcrConverter) get a chance. + if not self._has_token(): + return False + + mimetype = (stream_info.mimetype or "").lower() + extension = (stream_info.extension or "").lower() + + if extension in ACCEPTED_FILE_EXTENSIONS: + return True + + for prefix in ACCEPTED_MIME_TYPE_PREFIXES: + if mimetype.startswith(prefix): + return True + + return False + + def convert( + self, + file_stream: BinaryIO, + stream_info: StreamInfo, + **kwargs: Any, + ) -> DocumentConverterResult: + if _dependency_exc_info is not None: + raise MissingDependencyException( + MISSING_DEPENDENCY_MESSAGE.format( + converter=type(self).__name__, + extension=".pdf", + feature="pdf", + ) + ) from _dependency_exc_info[1].with_traceback(_dependency_exc_info[2]) + + extension = (stream_info.extension or "").lower() + + logger.info("PaddleOcrConverter: 开始转换, 文件类型=%s", extension) + + # Image files: use PaddleOCR directly + if extension in (".jpg", ".jpeg", ".png"): + return self._convert_image(file_stream, extension) + + # PDF files: use hybrid approach + return self._convert_pdf(file_stream) + + def _convert_image( + self, file_stream: BinaryIO, extension: str = ".png" + ) -> DocumentConverterResult: + """Convert image file using PaddleOCR API.""" + img_bytes = file_stream.read() + filename = f"image{extension}" + + logger.info("PaddleOcrConverter: 开始 OCR 识别图片, 格式=%s", extension) + try: + markdown = self._get_client().ocr(file_bytes=img_bytes, filename=filename) + except Exception as e: + logger.error( + "PaddleOcrConverter: 图片 OCR 识别异常, 格式=%s, 错误=%s", extension, e + ) + raise + + logger.info("PaddleOcrConverter: 图片 OCR 识别完成, 输出长度=%d", len(markdown)) + return DocumentConverterResult(markdown=markdown) + + def _convert_pdf(self, file_stream: BinaryIO) -> DocumentConverterResult: + """Convert PDF using hybrid approach (pdfplumber for text, PaddleOCR for complex pages).""" + pdf_stream = io.BytesIO(file_stream.read()) + pdf_bytes = pdf_stream.getvalue() # Keep original bytes for batch OCR + markdown_parts = [] + ocr_failed = False + + try: + with pdfplumber.open(pdf_stream) as pdf: + total_pages = len(pdf.pages) + logger.info("PaddleOcrConverter: 开始处理 PDF, 总页数=%d", total_pages) + + # Optimization: detect if entire PDF is scanned + all_scanned = self._detect_all_scanned(pdf) + + if all_scanned and not self.force_ai: + # Batch mode: upload entire PDF to OCR API (single API call) + logger.info( + "PaddleOcrConverter: 全文档扫描模式, 批量上传PDF, 页数=%d", + total_pages, + ) + try: + markdown = self._convert_pdf_batch(pdf_bytes) + if markdown.strip(): + logger.info( + "PaddleOcrConverter: 批量OCR完成, 输出长度=%d", + len(markdown), + ) + return DocumentConverterResult(markdown=markdown) + except Exception as e: + logger.warning( + "PaddleOcrConverter: 批量OCR失败, 降级为逐页处理, 错误=%s", + e, + ) + ocr_failed = True + # Fall through to per-page processing + + # Per-page processing (PAGE_BY_PAGE mode or batch failed) + for page_num, page in enumerate(pdf.pages): + # Choose processing method + if self.force_ai or all_scanned: + # All scanned (after batch failed) or force_ai + logger.info( + "PaddleOcrConverter: 第 %d/%d 页, 使用 PaddleOCR", + page_num + 1, + total_pages, + ) + try: + markdown = self._convert_with_paddleocr(page, page_num) + except Exception as e: + logger.warning( + "PaddleOcrConverter: 第 %d/%d 页 OCR 失败, 降级为 pdfplumber, 错误=%s", + page_num + 1, + total_pages, + e, + ) + ocr_failed = True + markdown = self._extract_text_with_tables(page) + else: + # Per-page analysis (PAGE_BY_PAGE mode or non-scanned doc) + page_type = self._analyze_page(page) + + if page_type != "plain_text": + logger.info( + "PaddleOcrConverter: 第 %d/%d 页, 类型=%s, 使用 PaddleOCR", + page_num + 1, + total_pages, + page_type, + ) + try: + markdown = self._convert_with_paddleocr(page, page_num) + except Exception as e: + logger.warning( + "PaddleOcrConverter: 第 %d/%d 页 OCR 失败, 降级为 pdfplumber, 错误=%s", + page_num + 1, + total_pages, + e, + ) + ocr_failed = True + markdown = self._extract_text_with_tables(page) + else: + logger.info( + "PaddleOcrConverter: 第 %d/%d 页, 类型=%s, 使用 pdfplumber", + page_num + 1, + total_pages, + page_type, + ) + markdown = self._extract_text_with_tables(page) + + if markdown.strip(): + markdown_parts.append(f"## Page {page_num + 1}\n\n{markdown}") + + page.close() + + markdown = "\n\n".join(markdown_parts).strip() + + except Exception as e: + logger.error( + "PaddleOcrConverter: PDF 处理异常, 降级为 pdfminer, 错误=%s", e + ) + # Fallback to pdfminer + pdf_stream.seek(0) + markdown = pdfminer.high_level.extract_text(pdf_stream) or "" + + # Final fallback + if not markdown: + pdf_stream.seek(0) + markdown = pdfminer.high_level.extract_text(pdf_stream) or "" + + # If OCR failed and result is empty, raise so the framework can try + # the next converter (e.g. GlmOcrConverter) instead of returning empty. + if ocr_failed and not markdown.strip(): + logger.error("PaddleOcrConverter: OCR 失败且所有兜底结果为空, 抛出异常") + raise RuntimeError( + "PaddleOcrConverter: OCR failed and all fallbacks returned empty" + ) + + logger.info("PaddleOcrConverter: PDF 转换完成, 输出长度=%d", len(markdown)) + return DocumentConverterResult(markdown=markdown) + + def _convert_pdf_batch(self, pdf_bytes: bytes) -> str: + """Convert entire PDF in a single API call. + + More efficient for scanned PDFs: one API call instead of N calls for N pages. + + Args: + pdf_bytes: Raw PDF file content. + + Returns: + Markdown text from all pages. + """ + logger.info( + "PaddleOcrConverter: 批量上传PDF到OCR API, 大小=%d bytes", len(pdf_bytes) + ) + markdown = self._get_client().ocr( + file_bytes=pdf_bytes, + filename="document.pdf", + ) + return markdown + + def _analyze_page(self, page: Any) -> str: + """Analyze page content type.""" + # Check for images + if hasattr(page, "images") and page.images: + return "complex" + + # Check for tables + tables = page.find_tables() + if tables: + return "complex" + + # Check for graphics/curves + if hasattr(page, "curves") and page.curves: + return "complex" + + return "plain_text" + + def _is_scanned_page(self, page: Any) -> bool: + """Check if a page is likely a scanned image. + + A page is considered scanned if: + 1. It contains images, AND + 2. It has very little extractable text (below threshold) + + Args: + page: pdfplumber page object + + Returns: + True if the page appears to be a scanned image + """ + # Must have images to be a scan + has_images = hasattr(page, "images") and bool(page.images) + if not has_images: + return False + + # Check extractable text length + try: + text = page.extract_text() or "" + text_len = len(text.strip()) + # If there's substantial text, it might be a mixed page or + # a digital PDF with embedded images + if text_len >= self.scan_text_threshold: + return False + except Exception: + # If text extraction fails, assume it's a scan + return True + + return True + + def _detect_all_scanned(self, pdf: Any) -> bool: + """Detect if entire PDF is scanned based on scan_detection_mode. + + Optimization: When first few pages are scanned, we can assume + all pages are scanned and skip per-page analysis. + + Args: + pdf: pdfplumber PDF object + + Returns: + True if entire PDF should be treated as scanned + """ + if self.scan_detection_mode == ScanDetectionMode.PAGE_BY_PAGE: + return False + + total_pages = len(pdf.pages) + if total_pages == 0: + return False + + if self.scan_detection_mode == ScanDetectionMode.FIRST_PAGE_HINT: + # Check only first page + first_page = pdf.pages[0] + is_scanned = self._is_scanned_page(first_page) + first_page.close() + if is_scanned: + logger.info( + "PaddleOcrConverter: 首页检测为扫描件, 模式=FIRST_PAGE_HINT, 全文档使用OCR" + ) + return is_scanned + + if self.scan_detection_mode == ScanDetectionMode.SAMPLING: + # Sample first N pages + sample_count = min(self.scan_sample_pages, total_pages) + scanned_count = 0 + + for i in range(sample_count): + page = pdf.pages[i] + if self._is_scanned_page(page): + scanned_count += 1 + + # If majority of sampled pages are scanned, treat all as scanned + majority_threshold = sample_count // 2 + 1 + all_scanned = scanned_count >= majority_threshold + + if all_scanned: + logger.info( + "PaddleOcrConverter: 抽样检测 %d/%d 页为扫描件, 模式=SAMPLING, 全文档使用OCR", + scanned_count, + sample_count, + ) + + return all_scanned + + return False + + def _convert_with_paddleocr(self, page: Any, page_num: int) -> str: + """Convert page using PaddleOCR API.""" + # Render page to image + img = page.to_image(resolution=150) + img_bytes = io.BytesIO() + img.save(img_bytes, format="PNG") + + logger.info("PaddleOcrConverter: PaddleOCR API 开始识别第 %d 页", page_num + 1) + try: + markdown = self._get_client().ocr( + file_bytes=img_bytes.getvalue(), + filename=f"page_{page_num + 1}.png", + ) + except Exception as e: + logger.error( + "PaddleOcrConverter: PaddleOCR API 第 %d 页识别异常, 错误=%s", + page_num + 1, + e, + ) + raise + + logger.info( + "PaddleOcrConverter: PaddleOCR API 第 %d 页识别完成, 输出长度=%d", + page_num + 1, + len(markdown), + ) + return markdown + + def _extract_text_with_tables(self, page: Any) -> str: + """Extract text and tables from page.""" + parts = [] + + # Extract text + text = page.extract_text() or "" + if text.strip(): + parts.append(text.strip()) + + # Extract tables + try: + tables = page.extract_tables() + if tables: + for table in tables: + if table: + md_table = self._table_to_markdown(table) + if md_table.strip(): + parts.append(md_table) + except Exception: + pass + + return "\n\n".join(parts) + + def _table_to_markdown(self, table: list[list[str]]) -> str: + """Convert table to Markdown.""" + if not table: + return "" + + # Filter None values + table = [[cell if cell is not None else "" for cell in row] for row in table] + + # Filter empty rows + table = [row for row in table if any(cell.strip() for cell in row)] + + if not table: + return "" + + # Calculate column widths + col_widths = [ + max(len(str(row[i])) if i < len(row) else 0 for row in table) + for i in range(max(len(row) for row in table)) + ] + + # Format table + lines = [] + for row_idx, row in enumerate(table): + padded_row = row + [""] * (len(col_widths) - len(row)) + line = ( + "| " + + " | ".join( + str(cell).ljust(width) + for cell, width in zip(padded_row, col_widths) + ) + + " |" + ) + lines.append(line) + + if row_idx == 0: + sep = "|" + "|".join("-" * (w + 2) for w in col_widths) + "|" + lines.append(sep) + + return "\n".join(lines) + + def close(self): + """Close the client.""" + self._client = None + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + self.close() diff --git a/packages/markitdown-paddleocr/src/markitdown_paddleocr/_dual_converter.py b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_dual_converter.py new file mode 100644 index 000000000..0957b9b87 --- /dev/null +++ b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_dual_converter.py @@ -0,0 +1,167 @@ +"""DualOcrConverter - glmocr (primary) → paddleocr (fallback) automatic degradation.""" + +import logging +from typing import Any, BinaryIO, Optional + +from markitdown import ( + DocumentConverter, + DocumentConverterResult, + MarkItDown, + StreamInfo, +) + +logger = logging.getLogger(__name__) + + +class DualOcrConverter(DocumentConverter): + """Dual OCR converter with automatic fallback: glmocr → paddleocr. + + Usage: + converter = DualOcrConverter() + md = MarkItDown(enable_plugins=False) + md.register_converter(converter, priority=-1.0) + result = md.convert("document.pdf") + """ + + def __init__( + self, + # glmocr kwargs + glmocr_api_key: Optional[str] = None, + glmocr_timeout: int = 1800, + glmocr_enable_layout: bool = False, + glmocr_force_ai: bool = False, + # paddleocr kwargs + paddleocr_token: Optional[str] = None, + paddleocr_model: str = "PaddleOCR-VL-1.6", + paddleocr_poll_interval: float = 2.0, + paddleocr_poll_timeout: float = 300.0, + paddleocr_force_ai: bool = False, + paddleocr_use_doc_orientation_classify: bool = False, + paddleocr_use_doc_unwarping: bool = False, + paddleocr_use_chart_recognition: bool = False, + ): + self.glmocr_kwargs = { + "api_key": glmocr_api_key, + "timeout": glmocr_timeout, + "enable_layout": glmocr_enable_layout, + "force_ai": glmocr_force_ai, + } + self.paddleocr_kwargs = { + "token": paddleocr_token, + "model": paddleocr_model, + "poll_interval": paddleocr_poll_interval, + "poll_timeout": paddleocr_poll_timeout, + "force_ai": paddleocr_force_ai, + "use_doc_orientation_classify": paddleocr_use_doc_orientation_classify, + "use_doc_unwarping": paddleocr_use_doc_unwarping, + "use_chart_recognition": paddleocr_use_chart_recognition, + } + + self._primary = None + self._fallback = None + self._init_converters() + + def _init_converters(self): + """Lazily init both converters.""" + try: + from markitdown_glmocr import GlmOcrConverter + + # Filter out None values + kwargs = {k: v for k, v in self.glmocr_kwargs.items() if v is not None} + self._primary = GlmOcrConverter(**kwargs) + logger.info("glmocr converter initialized (primary)") + except Exception as e: + logger.warning("glmocr init failed: %s", e) + self._primary = None + + try: + from markitdown_paddleocr import PaddleOcrConverter + + kwargs = {k: v for k, v in self.paddleocr_kwargs.items() if v is not None} + self._fallback = PaddleOcrConverter(**kwargs) + logger.info("paddleocr converter initialized (fallback)") + except Exception as e: + logger.warning("paddleocr init failed: %s", e) + self._fallback = None + + def accepts( + self, + file_stream: BinaryIO, + stream_info: StreamInfo, + **kwargs: Any, + ) -> bool: + """Accept if either converter accepts.""" + if self._primary: + try: + file_stream.seek(0) + if self._primary.accepts(file_stream, stream_info, **kwargs): + return True + except Exception: + pass + + if self._fallback: + try: + file_stream.seek(0) + if self._fallback.accepts(file_stream, stream_info, **kwargs): + return True + except Exception: + pass + + return False + + def convert( + self, + file_stream: BinaryIO, + stream_info: StreamInfo, + **kwargs: Any, + ) -> DocumentConverterResult: + """Convert with primary, fallback on failure.""" + data = file_stream.read() + + # Try primary (glmocr) + if self._primary: + try: + result = self._primary.convert(io_bytes(data), stream_info, **kwargs) + if result.markdown and result.markdown.strip(): + logger.info("✓ glmocr succeeded") + return result + logger.warning("glmocr returned empty result, falling back") + except Exception as e: + logger.warning("glmocr failed: %s, falling back to paddleocr", e) + + # Fallback (paddleocr) + if self._fallback: + try: + result = self._fallback.convert(io_bytes(data), stream_info, **kwargs) + if result.markdown and result.markdown.strip(): + logger.info("✓ paddleocr succeeded (fallback)") + return result + logger.warning("paddleocr returned empty result") + except Exception as e: + logger.error("paddleocr also failed: %s", e) + + # Both failed + return DocumentConverterResult( + markdown="" + ) + + def close(self): + if self._primary and hasattr(self._primary, "close"): + self._primary.close() + if self._fallback and hasattr(self._fallback, "close"): + self._fallback.close() + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + self.close() + + +def io_bytes(data: bytes): + """Create a seekable BytesIO from bytes.""" + import io + + buf = io.BytesIO(data) + buf.seek(0) + return buf diff --git a/packages/markitdown-paddleocr/src/markitdown_paddleocr/_paddle_client.py b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_paddle_client.py new file mode 100644 index 000000000..ba12e51c9 --- /dev/null +++ b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_paddle_client.py @@ -0,0 +1,189 @@ +"""PaddleOCR API Client - handles job submission, polling, and result fetching.""" + +import json +import logging +import time +from typing import Optional + +import requests + +from ._config import PaddleOcrConfig + +logger = logging.getLogger(__name__) + + +class PaddleOcrError(Exception): + """PaddleOCR API error.""" + + pass + + +class PaddleClient: + """Client for PaddleOCR cloud API. + + Workflow: submit job → poll status → fetch JSONL result → extract markdown. + """ + + def __init__(self, config: Optional[PaddleOcrConfig] = None, **kwargs): + if config is None: + config = PaddleOcrConfig(**kwargs) + self.config = config + + # Token from config or env + self.token = config.token + if not self.token: + import os + self.token = os.environ.get("BAIDU_PADDLE_TOKEN", "") + + def _headers(self) -> dict: + """Build authorization headers.""" + return {"Authorization": f"bearer {self.token}"} + + def _optional_payload(self) -> dict: + """Build optional payload flags.""" + return { + "useDocOrientationClassify": self.config.use_doc_orientation_classify, + "useDocUnwarping": self.config.use_doc_unwarping, + "useChartRecognition": self.config.use_chart_recognition, + } + + def ocr( + self, + file_bytes: Optional[bytes] = None, + filename: Optional[str] = None, + file_url: Optional[str] = None, + ) -> str: + """Run OCR on a file or URL, return concatenated markdown. + + Args: + file_bytes: File content bytes (for local file upload). + filename: Filename for multipart upload (e.g. "page.png"). + file_url: File URL (for URL mode, alternative to file_bytes). + + Returns: + Markdown text extracted from all pages. + + Raises: + PaddleOcrError: On API errors or timeout. + """ + # 1. Submit job + job_id = self._submit(file_bytes=file_bytes, filename=filename, file_url=file_url) + logger.info("Job submitted: %s", job_id) + + # 2. Poll until done + result_url = self._poll(job_id) + logger.info("Job completed, result URL obtained") + + # 3. Fetch and parse results + return self._fetch_markdown(result_url) + + def _submit( + self, + file_bytes: Optional[bytes] = None, + filename: Optional[str] = None, + file_url: Optional[str] = None, + ) -> str: + """Submit an OCR job, return job ID.""" + headers = self._headers() + + if file_url: + # URL mode + headers["Content-Type"] = "application/json" + payload = { + "fileUrl": file_url, + "model": self.config.model, + "optionalPayload": self._optional_payload(), + } + resp = requests.post(self.config.job_url, json=payload, headers=headers) + elif file_bytes is not None: + # Local file mode - multipart upload + data = { + "model": self.config.model, + "optionalPayload": json.dumps(self._optional_payload()), + } + fname = filename or "document" + files = {"file": (fname, file_bytes)} + resp = requests.post(self.config.job_url, headers=headers, data=data, files=files) + else: + raise PaddleOcrError("Either file_bytes or file_url must be provided") + + if resp.status_code != 200: + raise PaddleOcrError(f"Submit failed (HTTP {resp.status_code}): {resp.text}") + + result = resp.json() + job_id = result.get("data", {}).get("jobId") + if not job_id: + raise PaddleOcrError(f"No jobId in response: {result}") + + return job_id + + def _poll(self, job_id: str) -> str: + """Poll job status until done, return JSONL result URL.""" + headers = self._headers() + url = f"{self.config.job_url}/{job_id}" + start = time.time() + + while True: + resp = requests.get(url, headers=headers) + if resp.status_code != 200: + raise PaddleOcrError(f"Poll failed (HTTP {resp.status_code}): {resp.text}") + + data = resp.json().get("data", {}) + state = data.get("state", "") + + if state == "done": + result_url = data.get("resultUrl", {}).get("jsonUrl", "") + if not result_url: + raise PaddleOcrError("Job done but no resultUrl in response") + return result_url + + if state == "failed": + error_msg = data.get("errorMsg", "Unknown error") + raise PaddleOcrError(f"Job failed: {error_msg}") + + # Still pending or running + if state == "running": + progress = data.get("extractProgress", {}) + total = progress.get("totalPages", "?") + extracted = progress.get("extractedPages", "?") + logger.debug("Running: %s/%s pages", extracted, total) + else: + logger.debug("State: %s", state) + + # Check timeout + elapsed = time.time() - start + if elapsed > self.config.poll_timeout: + raise PaddleOcrError( + f"Job polling timed out after {self.config.poll_timeout}s (state={state})" + ) + + time.sleep(self.config.poll_interval) + + def _fetch_markdown(self, jsonl_url: str) -> str: + """Fetch JSONL result and extract markdown from all pages.""" + resp = requests.get(jsonl_url) + resp.raise_for_status() + + markdown_parts = [] + lines = resp.text.strip().split("\n") + + for line in lines: + line = line.strip() + if not line: + continue + + try: + page_data = json.loads(line) + except json.JSONDecodeError: + logger.warning("Skipping invalid JSONL line") + continue + + result = page_data.get("result", {}) + layout_results = result.get("layoutParsingResults", []) + + for layout in layout_results: + md_text = layout.get("markdown", {}).get("text", "") + if md_text.strip(): + markdown_parts.append(md_text.strip()) + + return "\n\n".join(markdown_parts) diff --git a/packages/markitdown-paddleocr/src/markitdown_paddleocr/_plugin.py b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_plugin.py new file mode 100644 index 000000000..e84e70bb8 --- /dev/null +++ b/packages/markitdown-paddleocr/src/markitdown_paddleocr/_plugin.py @@ -0,0 +1,50 @@ +"""Plugin registration for markitdown-paddleocr.""" + +import logging +from typing import Any + +from markitdown import MarkItDown + +from ._converter import PaddleOcrConverter + +__plugin_interface_version__ = 1 + +logger = logging.getLogger(__name__) + + +def register_converters(markitdown: MarkItDown, **kwargs: Any) -> None: + """Register markitdown-paddleocr converter. + + Config sources (priority high to low): + 1. kwargs parameters + 2. Environment variables (BAIDU_PADDLE_TOKEN) + 3. Built-in defaults + """ + logger.info("markitdown-paddleocr: 开始注册插件") + + # Register converter with higher priority than default PDF converter + PRIORITY_PADDLEOCR = -1.0 + + try: + converter = PaddleOcrConverter( + token=kwargs.get("token"), + model=kwargs.get("model", "PaddleOCR-VL-1.6"), + poll_interval=kwargs.get("poll_interval", 2.0), + poll_timeout=kwargs.get("poll_timeout", 300.0), + force_ai=kwargs.get("force_ai", False), + use_doc_orientation_classify=kwargs.get( + "use_doc_orientation_classify", False + ), + use_doc_unwarping=kwargs.get("use_doc_unwarping", False), + use_chart_recognition=kwargs.get("use_chart_recognition", False), + ) + markitdown.register_converter( + converter, + priority=PRIORITY_PADDLEOCR, + ) + logger.info( + "markitdown-paddleocr: 插件注册成功, priority=%.1f", PRIORITY_PADDLEOCR + ) + except Exception as e: + logger.error("markitdown-paddleocr: 插件注册失败, 错误=%s", e) + raise diff --git a/packages/markitdown-paddleocr/tests/__init__.py b/packages/markitdown-paddleocr/tests/__init__.py new file mode 100644 index 000000000..4be5c24f3 --- /dev/null +++ b/packages/markitdown-paddleocr/tests/__init__.py @@ -0,0 +1 @@ +"""Tests for markitdown-paddleocr.""" diff --git a/packages/markitdown-paddleocr/tests/test_converter.py b/packages/markitdown-paddleocr/tests/test_converter.py new file mode 100644 index 000000000..0e569dc94 --- /dev/null +++ b/packages/markitdown-paddleocr/tests/test_converter.py @@ -0,0 +1,220 @@ +"""Tests for PaddleOcrConverter.""" + +import io +import pytest +from unittest.mock import MagicMock, patch + +from markitdown_paddleocr._converter import PaddleOcrConverter + + +class TestPaddleOcrConverterAccepts: + """Accepts method tests.""" + + def test_accepts_pdf_extension_with_token(self): + """Accept .pdf extension when token is available.""" + converter = PaddleOcrConverter(token="test-token") + stream = io.BytesIO(b"%PDF-1.4") + stream_info = MagicMock(extension=".pdf", mimetype=None) + assert converter.accepts(stream, stream_info) is True + + def test_accepts_pdf_mimetype_with_token(self): + """Accept PDF MIME type when token is available.""" + converter = PaddleOcrConverter(token="test-token") + stream = io.BytesIO(b"%PDF-1.4") + stream_info = MagicMock(extension=None, mimetype="application/pdf") + assert converter.accepts(stream, stream_info) is True + + def test_accepts_image_extensions_with_token(self): + """Accept image extensions when token is available.""" + converter = PaddleOcrConverter(token="test-token") + for ext in [".jpg", ".jpeg", ".png"]: + stream = io.BytesIO(b"fake") + stream_info = MagicMock(extension=ext, mimetype=None) + assert converter.accepts(stream, stream_info) is True + + def test_rejects_without_token(self): + """Reject all files when no token is available.""" + converter = PaddleOcrConverter() # no token + stream = io.BytesIO(b"%PDF-1.4") + stream_info = MagicMock(extension=".pdf", mimetype="application/pdf") + assert converter.accepts(stream, stream_info) is False + + def test_rejects_non_supported(self): + """Reject non-supported files.""" + converter = PaddleOcrConverter() + stream = io.BytesIO(b"not a pdf") + stream_info = MagicMock(extension=".txt", mimetype="text/plain") + assert converter.accepts(stream, stream_info) is False + + +class TestPaddleOcrConverterTable: + """Table to Markdown conversion tests.""" + + def test_table_to_markdown(self): + """Table to Markdown conversion.""" + converter = PaddleOcrConverter() + table = [ + ["Name", "Age", "City"], + ["Alice", "25", "Beijing"], + ["Bob", "30", "Shanghai"], + ] + result = converter._table_to_markdown(table) + assert "|" in result + assert "Name" in result + assert "Alice" in result + assert "---" in result + + def test_empty_table(self): + """Empty table returns empty string.""" + converter = PaddleOcrConverter() + assert converter._table_to_markdown([]) == "" + + def test_table_with_none_values(self): + """Table with None values.""" + converter = PaddleOcrConverter() + table = [ + ["A", None, "C"], + ["1", "2", None], + ] + result = converter._table_to_markdown(table) + assert "|" in result + assert "A" in result + + +class TestPaddleOcrConverterImage: + """Image conversion tests.""" + + def test_convert_image_success(self): + """Convert image with PaddleOCR success.""" + converter = PaddleOcrConverter(token="test-token") + + mock_client = MagicMock() + mock_client.ocr.return_value = "# Image Title\n\nContent" + converter._client = mock_client + + stream = io.BytesIO(b"fake-image") + stream_info = MagicMock(extension=".png", mimetype="image/png") + result = converter.convert(stream, stream_info) + + assert "# Image Title" in result.markdown + mock_client.ocr.assert_called_once() + + def test_convert_image_error_raises(self): + """Convert image with PaddleOCR error raises exception (for framework fallback).""" + converter = PaddleOcrConverter(token="test-token") + + mock_client = MagicMock() + mock_client.ocr.side_effect = Exception("API Error") + converter._client = mock_client + + stream = io.BytesIO(b"fake-image") + stream_info = MagicMock(extension=".png", mimetype="image/png") + with pytest.raises(Exception, match="API Error"): + converter.convert(stream, stream_info) + + +class TestPaddleOcrConverterPdf: + """PDF conversion tests.""" + + def test_plain_text_page(self): + """Plain text page uses pdfplumber.""" + converter = PaddleOcrConverter() + + page = MagicMock() + page.images = [] + page.find_tables.return_value = [] + page.extract_tables.return_value = [] + page.extract_text.return_value = "Hello World" + page.close = MagicMock() + + mock_pdf = MagicMock() + mock_pdf.pages = [page] + + with patch("markitdown_paddleocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = mock_pdf + stream = io.BytesIO(b"%PDF-1.4") + result = converter.convert(stream, MagicMock(extension=".pdf", mimetype=None)) + + assert "Hello World" in result.markdown + + def test_complex_page_uses_paddleocr(self): + """Complex page uses PaddleOCR.""" + converter = PaddleOcrConverter(token="test-token") + + mock_client = MagicMock() + mock_client.ocr.return_value = "OCR result for complex page" + converter._client = mock_client + + page = MagicMock() + page.images = [MagicMock()] + page.find_tables.return_value = [] + page.to_image.return_value.save = MagicMock( + side_effect=lambda buf, format: buf.write(b"fake-png") + ) + page.close = MagicMock() + + mock_pdf = MagicMock() + mock_pdf.pages = [page] + + with patch("markitdown_paddleocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = mock_pdf + stream = io.BytesIO(b"%PDF-1.4") + result = converter.convert(stream, MagicMock(extension=".pdf", mimetype=None)) + + mock_client.ocr.assert_called_once() + assert "OCR result" in result.markdown + + def test_force_ai_mode(self): + """Force AI mode uses PaddleOCR for all pages.""" + converter = PaddleOcrConverter(token="test-token", force_ai=True) + + mock_client = MagicMock() + mock_client.ocr.return_value = "AI result" + converter._client = mock_client + + page = MagicMock() + page.images = [] + page.find_tables.return_value = [] + page.to_image.return_value.save = MagicMock( + side_effect=lambda buf, format: buf.write(b"fake-png") + ) + page.close = MagicMock() + + mock_pdf = MagicMock() + mock_pdf.pages = [page] + + with patch("markitdown_paddleocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = mock_pdf + stream = io.BytesIO(b"%PDF-1.4") + result = converter.convert(stream, MagicMock(extension=".pdf", mimetype=None)) + + mock_client.ocr.assert_called_once() + + +class TestPaddleOcrConverterConfig: + """Config initialization tests.""" + + def test_default_config(self): + """Default configuration values.""" + converter = PaddleOcrConverter() + assert converter.model == "PaddleOCR-VL-1.5" + assert converter.poll_interval == 2.0 + assert converter.poll_timeout == 300.0 + assert converter.force_ai is False + + def test_custom_config(self): + """Custom configuration values.""" + converter = PaddleOcrConverter( + token="my-token", + model="custom-model", + poll_interval=5.0, + poll_timeout=600.0, + force_ai=True, + use_chart_recognition=True, + ) + assert converter.token == "my-token" + assert converter.model == "custom-model" + assert converter.poll_interval == 5.0 + assert converter.poll_timeout == 600.0 + assert converter.force_ai is True + assert converter.use_chart_recognition is True diff --git a/packages/markitdown-paddleocr/tests/test_paddle_client.py b/packages/markitdown-paddleocr/tests/test_paddle_client.py new file mode 100644 index 000000000..361a329b6 --- /dev/null +++ b/packages/markitdown-paddleocr/tests/test_paddle_client.py @@ -0,0 +1,241 @@ +"""Tests for PaddleClient.""" + +import json +import pytest +from unittest.mock import MagicMock, patch + +from markitdown_paddleocr._paddle_client import PaddleClient, PaddleOcrError +from markitdown_paddleocr._config import PaddleOcrConfig + + +class TestPaddleClientInit: + """Client initialization tests.""" + + def test_init_with_token(self): + """Init with explicit token.""" + client = PaddleClient(token="test-token") + assert client.token == "test-token" + + @patch.dict("os.environ", {"BAIDU_PADDLE_TOKEN": "env-token"}) + def test_init_from_env(self): + """Init from environment variable.""" + client = PaddleClient() + assert client.token == "env-token" + + def test_init_with_config(self): + """Init with PaddleOcrConfig.""" + config = PaddleOcrConfig(token="config-token", model="custom-model") + client = PaddleClient(config=config) + assert client.token == "config-token" + assert client.config.model == "custom-model" + + +class TestPaddleClientSubmit: + """Job submission tests.""" + + def test_submit_local_file(self): + """Submit local file via multipart upload.""" + client = PaddleClient(token="test-token") + + mock_response = MagicMock() + mock_response.status_code = 200 + mock_response.json.return_value = {"data": {"jobId": "job-123"}} + + with patch("requests.post", return_value=mock_response) as mock_post: + job_id = client._submit(file_bytes=b"fake-image", filename="test.png") + + assert job_id == "job-123" + # Verify multipart upload was used (files parameter) + call_kwargs = mock_post.call_args + assert "files" in call_kwargs.kwargs or len(call_kwargs.args) > 0 + + def test_submit_url_mode(self): + """Submit file URL via JSON.""" + client = PaddleClient(token="test-token") + + mock_response = MagicMock() + mock_response.status_code = 200 + mock_response.json.return_value = {"data": {"jobId": "job-456"}} + + with patch("requests.post", return_value=mock_response) as mock_post: + job_id = client._submit(file_url="https://example.com/doc.pdf") + + assert job_id == "job-456" + + def test_submit_error(self): + """Submit with API error.""" + client = PaddleClient(token="test-token") + + mock_response = MagicMock() + mock_response.status_code = 500 + mock_response.text = "Internal Server Error" + + with patch("requests.post", return_value=mock_response): + with pytest.raises(PaddleOcrError, match="Submit failed"): + client._submit(file_bytes=b"fake", filename="test.png") + + def test_submit_no_input(self): + """Submit without file or URL raises error.""" + client = PaddleClient(token="test-token") + with pytest.raises(PaddleOcrError, match="Either file_bytes or file_url"): + client._submit() + + +class TestPaddleClientPoll: + """Job polling tests.""" + + def test_poll_done_immediately(self): + """Job is done on first poll.""" + client = PaddleClient(token="test-token") + + mock_response = MagicMock() + mock_response.status_code = 200 + mock_response.json.return_value = { + "data": { + "state": "done", + "resultUrl": {"jsonUrl": "https://result.url/data.jsonl"}, + } + } + + with patch("requests.get", return_value=mock_response): + result_url = client._poll("job-123") + + assert result_url == "https://result.url/data.jsonl" + + def test_poll_failed(self): + """Job fails.""" + client = PaddleClient(token="test-token") + + mock_response = MagicMock() + mock_response.status_code = 200 + mock_response.json.return_value = { + "data": {"state": "failed", "errorMsg": "Processing error"} + } + + with patch("requests.get", return_value=mock_response): + with pytest.raises(PaddleOcrError, match="Job failed"): + client._poll("job-123") + + def test_poll_timeout(self): + """Polling timeout.""" + config = PaddleOcrConfig(token="test-token", poll_interval=0.01, poll_timeout=0.05) + client = PaddleClient(config=config) + + mock_response = MagicMock() + mock_response.status_code = 200 + mock_response.json.return_value = {"data": {"state": "pending"}} + + with patch("requests.get", return_value=mock_response): + with pytest.raises(PaddleOcrError, match="timed out"): + client._poll("job-123") + + +class TestPaddleClientFetchMarkdown: + """Result fetching tests.""" + + def test_fetch_single_page(self): + """Fetch single page result.""" + client = PaddleClient(token="test-token") + + jsonl_content = json.dumps({ + "result": { + "layoutParsingResults": [ + {"markdown": {"text": "# Title\n\nHello world"}} + ] + } + }) + + mock_response = MagicMock() + mock_response.text = jsonl_content + mock_response.raise_for_status = MagicMock() + + with patch("requests.get", return_value=mock_response): + markdown = client._fetch_markdown("https://result.url/data.jsonl") + + assert "# Title" in markdown + assert "Hello world" in markdown + + def test_fetch_multi_page(self): + """Fetch multi-page result.""" + client = PaddleClient(token="test-token") + + page1 = json.dumps({ + "result": { + "layoutParsingResults": [ + {"markdown": {"text": "Page 1 content"}} + ] + } + }) + page2 = json.dumps({ + "result": { + "layoutParsingResults": [ + {"markdown": {"text": "Page 2 content"}} + ] + } + }) + jsonl_content = f"{page1}\n{page2}" + + mock_response = MagicMock() + mock_response.text = jsonl_content + mock_response.raise_for_status = MagicMock() + + with patch("requests.get", return_value=mock_response): + markdown = client._fetch_markdown("https://result.url/data.jsonl") + + assert "Page 1 content" in markdown + assert "Page 2 content" in markdown + + def test_fetch_empty_result(self): + """Fetch empty result.""" + client = PaddleClient(token="test-token") + + mock_response = MagicMock() + mock_response.text = "" + mock_response.raise_for_status = MagicMock() + + with patch("requests.get", return_value=mock_response): + markdown = client._fetch_markdown("https://result.url/data.jsonl") + + assert markdown == "" + + +class TestPaddleClientOcr: + """Full OCR workflow tests.""" + + def test_ocr_workflow(self): + """Complete OCR workflow: submit → poll → fetch.""" + client = PaddleClient(token="test-token") + + # Mock submit + submit_resp = MagicMock() + submit_resp.status_code = 200 + submit_resp.json.return_value = {"data": {"jobId": "job-789"}} + + # Mock poll + poll_resp = MagicMock() + poll_resp.status_code = 200 + poll_resp.json.return_value = { + "data": { + "state": "done", + "resultUrl": {"jsonUrl": "https://result.url/data.jsonl"}, + } + } + + # Mock fetch + jsonl_content = json.dumps({ + "result": { + "layoutParsingResults": [ + {"markdown": {"text": "# OCR Result\n\nExtracted text."}} + ] + } + }) + fetch_resp = MagicMock() + fetch_resp.text = jsonl_content + fetch_resp.raise_for_status = MagicMock() + + with patch("requests.post", return_value=submit_resp), \ + patch("requests.get", side_effect=[poll_resp, fetch_resp]): + markdown = client.ocr(file_bytes=b"fake-image", filename="test.png") + + assert "# OCR Result" in markdown + assert "Extracted text." in markdown diff --git a/packages/markitdown-paddleocr/tests/test_scan_detection.py b/packages/markitdown-paddleocr/tests/test_scan_detection.py new file mode 100644 index 000000000..116197fe6 --- /dev/null +++ b/packages/markitdown-paddleocr/tests/test_scan_detection.py @@ -0,0 +1,430 @@ +"""Tests for scan detection optimization.""" + +import pytest +from unittest.mock import MagicMock, patch + +from markitdown_paddleocr._config import PaddleOcrConfig, ScanDetectionMode +from markitdown_paddleocr._converter import PaddleOcrConverter + + +class TestScanDetectionMode: + """扫描检测模式配置测试""" + + def test_default_mode_is_sampling(self): + """默认模式应为 SAMPLING""" + config = PaddleOcrConfig() + assert config.scan_detection_mode == ScanDetectionMode.SAMPLING + + def test_custom_mode_from_config(self): + """从配置对象读取自定义模式""" + config = PaddleOcrConfig(scan_detection_mode=ScanDetectionMode.FIRST_PAGE_HINT) + converter = PaddleOcrConverter(config=config, token="test_token") + assert converter.scan_detection_mode == ScanDetectionMode.FIRST_PAGE_HINT + + def test_custom_mode_from_constructor(self): + """从构造函数传入自定义模式""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.PAGE_BY_PAGE, + ) + assert converter.scan_detection_mode == ScanDetectionMode.PAGE_BY_PAGE + + def test_constructor_overrides_config(self): + """构造函数参数优先于配置对象""" + config = PaddleOcrConfig(scan_detection_mode=ScanDetectionMode.FIRST_PAGE_HINT) + converter = PaddleOcrConverter( + config=config, + token="test_token", + scan_detection_mode=ScanDetectionMode.PAGE_BY_PAGE, + ) + assert converter.scan_detection_mode == ScanDetectionMode.PAGE_BY_PAGE + + +class TestIsScannedPage: + """扫描页面检测测试""" + + def test_page_without_images_not_scanned(self): + """无图片的页面不是扫描件""" + converter = PaddleOcrConverter(token="test_token") + + page = MagicMock() + page.images = [] + page.extract_text.return_value = "Some text content here" + + assert converter._is_scanned_page(page) is False + + def test_page_with_images_and_text_not_scanned(self): + """有图片但有足够文本的页面不是扫描件""" + converter = PaddleOcrConverter(token="test_token", scan_text_threshold=50) + + page = MagicMock() + page.images = [MagicMock()] + page.extract_text.return_value = "This is more than 50 characters of text content that should be extracted" + + assert converter._is_scanned_page(page) is False + + def test_page_with_images_no_text_is_scanned(self): + """有图片但无文本的页面是扫描件""" + converter = PaddleOcrConverter(token="test_token", scan_text_threshold=50) + + page = MagicMock() + page.images = [MagicMock()] + page.extract_text.return_value = "" + + assert converter._is_scanned_page(page) is True + + def test_page_with_images_little_text_is_scanned(self): + """有图片但文本少于阈值的页面是扫描件""" + converter = PaddleOcrConverter(token="test_token", scan_text_threshold=50) + + page = MagicMock() + page.images = [MagicMock()] + page.extract_text.return_value = "Short text" # Only 10 chars + + assert converter._is_scanned_page(page) is True + + def test_text_extraction_error_assumes_scanned(self): + """文本提取失败时假定是扫描件""" + converter = PaddleOcrConverter(token="test_token") + + page = MagicMock() + page.images = [MagicMock()] + page.extract_text.side_effect = Exception("Extraction failed") + + assert converter._is_scanned_page(page) is True + + def test_custom_threshold(self): + """自定义阈值生效""" + converter = PaddleOcrConverter(token="test_token", scan_text_threshold=100) + + # Text below threshold + page1 = MagicMock() + page1.images = [MagicMock()] + page1.extract_text.return_value = "This is exactly 50 characters" # ~30 chars + + assert converter._is_scanned_page(page1) is True + + # Text above threshold + page2 = MagicMock() + page2.images = [MagicMock()] + page2.extract_text.return_value = "This is definitely more than 100 characters of text content here for testing and verification purposes" # 106 chars + + assert converter._is_scanned_page(page2) is False + + +class TestDetectAllScanned: + """全文档扫描检测测试""" + + def test_page_by_page_mode_returns_false(self): + """PAGE_BY_PAGE 模式永远返回 False""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.PAGE_BY_PAGE, + ) + + # Even with all scanned pages + pdf = MagicMock() + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + pdf.pages = [scanned_page, scanned_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is False + + def test_first_page_hint_first_page_scanned(self): + """FIRST_PAGE_HINT 模式,首页扫描则全文档扫描""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.FIRST_PAGE_HINT, + ) + + # First page scanned + pdf = MagicMock() + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + pdf.pages = [scanned_page, normal_page, normal_page] + + assert converter._detect_all_scanned(pdf) is True + + def test_first_page_hint_first_page_not_scanned(self): + """FIRST_PAGE_HINT 模式,首页非扫描则不判定全扫描""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.FIRST_PAGE_HINT, + ) + + # First page not scanned + pdf = MagicMock() + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + pdf.pages = [normal_page, scanned_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is False + + def test_sampling_mode_majority_scanned(self): + """SAMPLING 模式,多数页面扫描则全文档扫描""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # 3 pages, 2 scanned, 1 normal -> majority scanned + pdf = MagicMock() + + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + pdf.pages = [scanned_page, scanned_page, normal_page] + + assert converter._detect_all_scanned(pdf) is True + + def test_sampling_mode_minority_scanned(self): + """SAMPLING 模式,少数页面扫描则不判定全扫描""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # 3 pages, 1 scanned, 2 normal -> minority scanned + pdf = MagicMock() + + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + pdf.pages = [normal_page, normal_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is False + + def test_sampling_mode_all_scanned(self): + """SAMPLING 模式,所有抽样页扫描则全文档扫描""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + pdf = MagicMock() + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + pdf.pages = [scanned_page, scanned_page, scanned_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is True + + def test_sampling_mode_custom_sample_count(self): + """SAMPLING 模式,自定义抽样页数""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=5, + ) + + # 5 pages sampled, 3 scanned -> majority + pdf = MagicMock() + + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + normal_page = MagicMock() + normal_page.images = [] + normal_page.extract_text.return_value = "Normal text" + + pdf.pages = [scanned_page, scanned_page, scanned_page, normal_page, normal_page] + + assert converter._detect_all_scanned(pdf) is True + + def test_empty_pdf_returns_false(self): + """空 PDF 返回 False""" + converter = PaddleOcrConverter(token="test_token") + + pdf = MagicMock() + pdf.pages = [] + + assert converter._detect_all_scanned(pdf) is False + + def test_pdf_with_less_pages_than_sample_count(self): + """PDF 页数少于抽样数时使用实际页数""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=5, + ) + + # Only 2 pages, both scanned -> majority + pdf = MagicMock() + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + + pdf.pages = [scanned_page, scanned_page] + + assert converter._detect_all_scanned(pdf) is True + + +class TestConvertPdfWithScanDetection: + """PDF 转换中的扫描检测集成测试""" + + def test_all_scanned_uses_batch_mode(self): + """全扫描模式优先使用批量上传""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # Mock _detect_all_scanned to return True + converter._detect_all_scanned = MagicMock(return_value=True) + converter._convert_pdf_batch = MagicMock(return_value="Batch OCR result") + converter._convert_with_paddleocr = MagicMock(return_value="Page OCR result") + + # Mock PDF + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + + pdf = MagicMock() + pdf.pages = [scanned_page, scanned_page] + + with patch("markitdown_paddleocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = pdf + + import io + stream = io.BytesIO(b"%PDF-1.4") + result = converter._convert_pdf(stream) + + # Should call batch mode (1 API call) + converter._convert_pdf_batch.assert_called_once() + # Should NOT call per-page OCR + converter._convert_with_paddleocr.assert_not_called() + assert "Batch OCR result" in result.markdown + + def test_batch_failure_fallback_to_per_page(self): + """批量OCR失败后降级为逐页处理""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # Mock _detect_all_scanned to return True + converter._detect_all_scanned = MagicMock(return_value=True) + converter._convert_pdf_batch = MagicMock(side_effect=RuntimeError("Batch API error")) + converter._convert_with_paddleocr = MagicMock(return_value="Page OCR result") + + # Mock PDF + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + + pdf = MagicMock() + pdf.pages = [scanned_page, scanned_page] + + with patch("markitdown_paddleocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = pdf + + import io + stream = io.BytesIO(b"%PDF-1.4") + result = converter._convert_pdf(stream) + + # Should have tried batch first + converter._convert_pdf_batch.assert_called_once() + # Should fall back to per-page OCR + assert converter._convert_with_paddleocr.call_count == 2 + + def test_all_scanned_skips_per_page_analysis(self): + """全扫描模式跳过逐页分析""" + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.SAMPLING, + scan_sample_pages=3, + ) + + # Mock _detect_all_scanned to return True + converter._detect_all_scanned = MagicMock(return_value=True) + converter._convert_pdf_batch = MagicMock(return_value="Batch OCR result") + converter._analyze_page = MagicMock(return_value="plain_text") + + # Mock PDF + scanned_page = MagicMock() + scanned_page.images = [MagicMock()] + scanned_page.extract_text.return_value = "" + scanned_page.close = MagicMock() + + pdf = MagicMock() + pdf.pages = [scanned_page, scanned_page] + + with patch("markitdown_paddleocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = pdf + + import io + stream = io.BytesIO(b"%PDF-1.4") + result = converter._convert_pdf(stream) + + # Should call batch mode, not _analyze_page + converter._convert_pdf_batch.assert_called_once() + converter._analyze_page.assert_not_called() + + def test_page_by_page_mode_analyzes_each_page(self): + """PAGE_BY_PAGE 模式分析每页""" + + converter = PaddleOcrConverter( + token="test_token", + scan_detection_mode=ScanDetectionMode.PAGE_BY_PAGE, + ) + + # Mock _analyze_page to return different results + converter._analyze_page = MagicMock(side_effect=["plain_text", "complex"]) + converter._convert_with_paddleocr = MagicMock(return_value="OCR result") + converter._extract_text_with_tables = MagicMock(return_value="Text result") + + # Mock PDF + page1 = MagicMock() + page1.close = MagicMock() + page2 = MagicMock() + page2.close = MagicMock() + + pdf = MagicMock() + pdf.pages = [page1, page2] + + with patch("markitdown_paddleocr._converter.pdfplumber.open") as mock_open: + mock_open.return_value.__enter__.return_value = pdf + + import io + stream = io.BytesIO(b"%PDF-1.4") + result = converter._convert_pdf(stream) + + # Should analyze each page + assert converter._analyze_page.call_count == 2 + # Should use different methods for different pages + converter._extract_text_with_tables.assert_called_once() + converter._convert_with_paddleocr.assert_called_once() \ No newline at end of file diff --git a/packages/markitdown/src/markitdown/__main__.py b/packages/markitdown/src/markitdown/__main__.py index 6085ad6bb..934b3df72 100644 --- a/packages/markitdown/src/markitdown/__main__.py +++ b/packages/markitdown/src/markitdown/__main__.py @@ -2,12 +2,14 @@ # # SPDX-License-Identifier: MIT import argparse -import sys import codecs -from textwrap import dedent +import logging +import sys from importlib.metadata import entry_points +from textwrap import dedent + from .__about__ import __version__ -from ._markitdown import MarkItDown, StreamInfo, DocumentConverterResult +from ._markitdown import DocumentConverterResult, MarkItDown, StreamInfo def main(): @@ -104,6 +106,14 @@ def main(): help="List installed 3rd-party plugins. Plugins are loaded when using the -p or --use-plugin option.", ) + parser.add_argument( + "--log-level", + type=str, + default="WARNING", + choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"], + help="Set the logging level (default: WARNING). Use INFO or DEBUG to see plugin logs.", + ) + parser.add_argument( "--keep-data-uris", action="store_true", @@ -113,6 +123,13 @@ def main(): parser.add_argument("filename", nargs="?") args = parser.parse_args() + # Configure logging + logging.basicConfig( + level=getattr(logging, args.log_level), + format="%(asctime)s %(levelname)-8s %(name)s: %(message)s", + datefmt="%H:%M:%S", + ) + # Parse the extension hint extension_hint = args.extension if extension_hint is not None: diff --git a/scripts/load_secrets.sh b/scripts/load_secrets.sh new file mode 100755 index 000000000..ede9291d0 --- /dev/null +++ b/scripts/load_secrets.sh @@ -0,0 +1,13 @@ +#!/bin/bash +# 加载本地敏感配置 + +if [ -f ".secrets.local" ]; then + echo "Loading secrets from .secrets.local" + set -a + source .secrets.local + set +a + echo "✓ Secrets loaded" +else + echo "✗ .secrets.local not found" + exit 1 +fi diff --git a/scripts/pypi-upload.ps1 b/scripts/pypi-upload.ps1 new file mode 100644 index 000000000..a1dbec0b6 --- /dev/null +++ b/scripts/pypi-upload.ps1 @@ -0,0 +1,76 @@ +# 上传 markitdown-glmocr 和 markitdown-paddleocr 到 PyPI +# 用法: .\scripts\pypi-upload.ps1 [-Version "0.2.0"] +# -Version: 可选,指定版本号,默认上传 dist 目录下所有文件 + +param( + [string]$Version = "" +) + +$ErrorActionPreference = "Stop" + +Write-Host "=== PyPI Upload Script ===" -ForegroundColor Green +Write-Host "" + +# 从用户环境变量读取 PYPI_API_TOKEN +$PypiToken = [System.Environment]::GetEnvironmentVariable('PYPI_API_TOKEN', 'User') + +if ([string]::IsNullOrEmpty($PypiToken)) { + Write-Host "错误: 未找到 PYPI_API_TOKEN 环境变量" -ForegroundColor Red + Write-Host "请在 Windows 用户环境变量中配置 PYPI_API_TOKEN" + exit 1 +} + +Write-Host "✓ PyPI API Token 已加载" -ForegroundColor Green +Write-Host "" + +# 设置 UTF-8 编码 +$env:PYTHONUTF8 = "1" + +$ScriptDir = Split-Path -Parent $MyInvocation.MyCommand.Path +$ProjectRoot = Split-Path -Parent $ScriptDir + +$Packages = @("markitdown-glmocr", "markitdown-paddleocr") + +foreach ($Pkg in $Packages) { + $PkgDir = Join-Path $ProjectRoot "packages\$Pkg" + $DistDir = Join-Path $PkgDir "dist" + + if (-not (Test-Path $DistDir)) { + Write-Host "跳过 $Pkg : dist 目录不存在" -ForegroundColor Yellow + continue + } + + Write-Host "--- 上传 $Pkg ---" -ForegroundColor Green + + # 获取包名格式 (markitdown-glmocr -> markitdown_glmocr) + $PkgName = $Pkg -replace '-', '_' + + # 确定要上传的文件 + if ($Version) { + $Pattern = "$PkgName-$Version*" + } else { + $Pattern = "$PkgName*" + } + + $UploadFiles = Get-ChildItem -Path $DistDir -Filter $Pattern -ErrorAction SilentlyContinue + + if ($UploadFiles) { + Write-Host "文件:" + $UploadFiles | ForEach-Object { Write-Host " $($_.Name)" } + Write-Host "" + + $FilesArg = $UploadFiles | ForEach-Object { $_.FullName } + & twine upload --username __token__ --password $PypiToken --disable-progress-bar @FilesArg + + # 提取版本号 + $LatestVersion = ($UploadFiles[0].Name | Select-String -Pattern '\d+\.\d+\.\d+').Matches.Value + Write-Host "✓ $Pkg 上传成功!" -ForegroundColor Green + Write-Host " https://pypi.org/project/$Pkg/$LatestVersion/" -ForegroundColor Cyan + Write-Host "" + } else { + Write-Host "跳过 $Pkg : 未找到版本 $Version 的构建文件" -ForegroundColor Yellow + Write-Host "" + } +} + +Write-Host "=== 上传完成 ===" -ForegroundColor Green diff --git a/scripts/pypi-upload.sh b/scripts/pypi-upload.sh new file mode 100644 index 000000000..dcd3ca6e6 --- /dev/null +++ b/scripts/pypi-upload.sh @@ -0,0 +1,79 @@ +#!/bin/bash +# 上传 markitdown-glmocr 和 markitdown-paddleocr 到 PyPI +# 用法: ./scripts/pypi-upload.sh [version] +# version: 可选,指定版本号,默认上传 dist 目录下所有文件 + +set -e + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +PROJECT_ROOT="$(dirname "$SCRIPT_DIR")" + +# 颜色输出 +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' # No Color + +echo -e "${GREEN}=== PyPI Upload Script ===${NC}" +echo "" + +# 从 Windows 用户环境变量读取 PYPI_API_TOKEN +if [ -z "$PYPI_API_TOKEN" ]; then + PYPI_API_TOKEN=$(powershell -Command "[System.Environment]::GetEnvironmentVariable('PYPI_API_TOKEN', 'User')" 2>/dev/null) +fi + +if [ -z "$PYPI_API_TOKEN" ] || [ "$PYPI_API_TOKEN" = "(no output)" ]; then + echo -e "${RED}错误: 未找到 PYPI_API_TOKEN 环境变量${NC}" + echo "请设置 PYPI_API_TOKEN 环境变量或在 Windows 用户环境变量中配置" + exit 1 +fi + +echo -e "${GREEN}✓ PyPI API Token 已加载${NC}" +echo "" + +# 设置 UTF-8 编码避免 Windows GBK 问题 +export PYTHONUTF8=1 + +VERSION="${1:-}" +PACKAGES=("markitdown-glmocr" "markitdown-paddleocr") + +for PKG in "${PACKAGES[@]}"; do + PKG_DIR="$PROJECT_ROOT/packages/$PKG" + + if [ ! -d "$PKG_DIR/dist" ]; then + echo -e "${YELLOW}跳过 $PKG: dist 目录不存在${NC}" + continue + fi + + echo -e "${GREEN}--- 上传 $PKG ---${NC}" + + # 获取包名格式 (markitdown-glmocr -> markitdown_glmocr) + PKG_NAME=$(echo "$PKG" | tr '-' '_') + + # 确定要上传的文件 + if [ -n "$VERSION" ]; then + UPLOAD_FILES="$PKG_DIR/dist/${PKG_NAME}-${VERSION}*" + else + UPLOAD_FILES="$PKG_DIR/dist/${PKG_NAME}*" + fi + + # 检查文件是否存在 + if ls $UPLOAD_FILES 1> /dev/null 2>&1; then + echo "文件:" + ls $UPLOAD_FILES + echo "" + + twine upload --username __token__ --password "$PYPI_API_TOKEN" --disable-progress-bar $UPLOAD_FILES + + # 从输出中提取版本号 + LATEST_VERSION=$(ls $UPLOAD_FILES | head -1 | grep -oP '\d+\.\d+\.\d+' | head -1) + echo -e "${GREEN}✓ $PKG 上传成功!${NC}" + echo " https://pypi.org/project/$PKG/${LATEST_VERSION:-latest}/" + echo "" + else + echo -e "${YELLOW}跳过 $PKG: 未找到版本 ${VERSION:-任何} 的构建文件${NC}" + echo "" + fi +done + +echo -e "${GREEN}=== 上传完成 ===${NC}"