Emit lone surrogates as \uXXXX escapes in unicode strings by mariopenterman · Pull Request #615 · zrax/pycdc

mariopenterman · 2026-06-22T22:11:40Z

A lone surrogate (U+D800..U+DFFF) in a unicode string is currently written raw, producing an undecodable .py.

Background

A lone surrogate is marshalled as the 3-byte WTF-8/CESU-8 sequence 0xED 0xA0-0xBF 0x80-0xBF, which is invalid UTF-8. Writing it raw yields output that cannot be re-read.

After this PR

The string renderer detects that sequence and emits the code point as a \uXXXX escape, which Python parses back to the identical surrogate:

s = '\ud800'

Well-formed multi-byte UTF-8 (accents, astral characters such as 😀) is still passed through unchanged.

The change is confined to pyc_string.cpp. A self-contained test is included; the full suite stays green.

A lone surrogate (U+D800..U+DFFF) in a unicode string is marshalled as the 3-byte WTF-8/CESU-8 sequence 0xED 0xA0-0xBF 0x80-0xBF, which is invalid UTF-8. Writing it raw produces an undecodable .py. Detect that sequence in the string renderer and emit the code point as a \uXXXX escape, which Python parses back to the identical surrogate. Well-formed multi-byte UTF-8 (accents, astral characters) is still passed through unchanged. Signed-off-by: Mario Penterman <mariopenterman@gmail.com>

mariopenterman force-pushed the pr/lone-surrogate-escape branch from 021dc80 to a18a130 Compare June 22, 2026 22:13

mariopenterman marked this pull request as ready for review June 22, 2026 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit lone surrogates as \uXXXX escapes in unicode strings#615

Emit lone surrogates as \uXXXX escapes in unicode strings#615
mariopenterman wants to merge 1 commit into
zrax:masterfrom
mariopenterman:pr/lone-surrogate-escape

mariopenterman commented Jun 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mariopenterman commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

After this PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mariopenterman commented Jun 22, 2026 •

edited

Loading