Skip to content

Emit lone surrogates as \uXXXX escapes in unicode strings#615

Open
mariopenterman wants to merge 1 commit into
zrax:masterfrom
mariopenterman:pr/lone-surrogate-escape
Open

Emit lone surrogates as \uXXXX escapes in unicode strings#615
mariopenterman wants to merge 1 commit into
zrax:masterfrom
mariopenterman:pr/lone-surrogate-escape

Conversation

@mariopenterman

@mariopenterman mariopenterman commented Jun 22, 2026

Copy link
Copy Markdown

A lone surrogate (U+D800..U+DFFF) in a unicode string is currently written raw, producing an undecodable .py.

Background

A lone surrogate is marshalled as the 3-byte WTF-8/CESU-8 sequence 0xED 0xA0-0xBF 0x80-0xBF, which is invalid UTF-8. Writing it raw yields output that cannot be re-read.

After this PR

The string renderer detects that sequence and emits the code point as a \uXXXX escape, which Python parses back to the identical surrogate:

s = '\ud800'

Well-formed multi-byte UTF-8 (accents, astral characters such as 😀) is still passed through unchanged.

The change is confined to pyc_string.cpp. A self-contained test is included; the full suite stays green.

A lone surrogate (U+D800..U+DFFF) in a unicode string is marshalled as the
3-byte WTF-8/CESU-8 sequence 0xED 0xA0-0xBF 0x80-0xBF, which is invalid UTF-8.
Writing it raw produces an undecodable .py. Detect that sequence in the string
renderer and emit the code point as a \uXXXX escape, which Python parses back
to the identical surrogate. Well-formed multi-byte UTF-8 (accents, astral
characters) is still passed through unchanged.

Signed-off-by: Mario Penterman <mariopenterman@gmail.com>
@mariopenterman mariopenterman force-pushed the pr/lone-surrogate-escape branch from 021dc80 to a18a130 Compare June 22, 2026 22:13
@mariopenterman mariopenterman marked this pull request as ready for review June 22, 2026 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant