Skip to content

Handle percent-encoded EPUB manifest hrefs#2036

Closed
xujiantop-crypto wants to merge 1 commit into
microsoft:mainfrom
xujiantop-crypto:codex/epub-encoded-href
Closed

Handle percent-encoded EPUB manifest hrefs#2036
xujiantop-crypto wants to merge 1 commit into
microsoft:mainfrom
xujiantop-crypto:codex/epub-encoded-href

Conversation

@xujiantop-crypto
Copy link
Copy Markdown

Summary

EPUB manifest href values are URI references, so filenames containing spaces or other escaped characters can appear as percent-encoded paths such as chapter%201.xhtml.

The current converter compares those raw href values against ZIP member names, which means a valid EPUB containing OPS/chapter 1.xhtml can silently omit the chapter content.

This PR resolves manifest hrefs relative to the OPF file and decodes the URI path before matching ZIP members.

Changes

  • Resolve EPUB spine item hrefs via urljoin relative to the OPF path.
  • Decode percent-encoded URI paths before looking up ZIP members.
  • Add a regression test with href="chapter%201.xhtml" and ZIP member chapter 1.xhtml.

Verification

  • python -m pytest tests/test_module_vectors.py::test_convert_local[test_vector14] tests/test_module_vectors.py::test_convert_stream_with_hints[test_vector14] tests/test_module_misc.py::test_epub_percent_encoded_manifest_href -q
  • python -m black --check src/markitdown/converters/_epub_converter.py tests/test_module_misc.py
  • git diff --check

I also ran broader vector tests locally, but this environment only has core dependencies installed. Vectors requiring optional extras such as docx, xlsx, pdf, and outlook fail with missing optional dependency errors unrelated to this EPUB change.

@xujiantop-crypto xujiantop-crypto force-pushed the codex/epub-encoded-href branch 3 times, most recently from d82279e to 6b98c01 Compare June 1, 2026 09:39
@xujiantop-crypto xujiantop-crypto force-pushed the codex/epub-encoded-href branch from 6b98c01 to ef179d8 Compare June 1, 2026 09:56
@xujiantop-crypto xujiantop-crypto deleted the codex/epub-encoded-href branch June 1, 2026 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant