Emit lone surrogates as \uXXXX escapes in unicode strings#615
Open
mariopenterman wants to merge 1 commit into
Open
Emit lone surrogates as \uXXXX escapes in unicode strings#615mariopenterman wants to merge 1 commit into
mariopenterman wants to merge 1 commit into
Conversation
A lone surrogate (U+D800..U+DFFF) in a unicode string is marshalled as the 3-byte WTF-8/CESU-8 sequence 0xED 0xA0-0xBF 0x80-0xBF, which is invalid UTF-8. Writing it raw produces an undecodable .py. Detect that sequence in the string renderer and emit the code point as a \uXXXX escape, which Python parses back to the identical surrogate. Well-formed multi-byte UTF-8 (accents, astral characters) is still passed through unchanged. Signed-off-by: Mario Penterman <mariopenterman@gmail.com>
021dc80 to
a18a130
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A lone surrogate (U+D800..U+DFFF) in a unicode string is currently written raw, producing an undecodable
.py.Background
A lone surrogate is marshalled as the 3-byte WTF-8/CESU-8 sequence
0xED 0xA0-0xBF 0x80-0xBF, which is invalid UTF-8. Writing it raw yields output that cannot be re-read.After this PR
The string renderer detects that sequence and emits the code point as a
\uXXXXescape, which Python parses back to the identical surrogate:Well-formed multi-byte UTF-8 (accents, astral characters such as 😀) is still passed through unchanged.
The change is confined to
pyc_string.cpp. A self-contained test is included; the full suite stays green.