Fix incorrect Content-Length for StringIO with multi-byte characters#7201
Fix incorrect Content-Length for StringIO with multi-byte characters#7201veeceey wants to merge 1 commit into
Conversation
|
just wanted to follow up and see if this is good to go or needs more work |
StringIO.tell() returns the character position, not the byte offset, so super_len() returned the wrong value for StringIO objects containing multi-byte UTF-8 characters (e.g. emoji). This caused an incorrect Content-Length header that violates RFC 9110 section 8.6. Read the remaining text and encode it to UTF-8 to measure the true byte length, consistent with how plain str bodies are already handled. Closes psf#6917
20d9eef to
0406663
Compare
There was a problem hiding this comment.
I reproduced the issue and the fix locally.
On current main, the reporter's core case still gives StringIO super_len=1 Content-Length=1 while plain str gives 4/4. On this branch the same StringIO case gives super_len=4 Content-Length=4, and the cursor-preservation/partially-read cases pass for me:
TOX_WORK_DIR=.codex-tmp/tox/requests-7201 tox -e py312-default -- tests/test_utils.py -q -k super_len
11 passed, 208 deselectedOne small test gap: since #6917 is user-visible through the prepared request header, I think it would be worth adding a direct PreparedRequest().prepare(..., data=io.StringIO(...)) assertion for Content-Length == "4" as well. The new super_len coverage is useful, but a header-level assertion would lock the actual behavior that regressed and make this less dependent on the prepare_content_length -> super_len path staying obvious.
Summary
Fixes #6917.
super_len()usesseek/tellto measure the length of file-like objects such asStringIOandBytesIO. However,StringIO.tell()returns the character position, not the byte offset. For strings containing multi-byte UTF-8 characters (e.g. emoji), this produces an incorrectContent-Lengthheader that violates RFC 9110 section 8.6.For example,
io.StringIO("\U0001F4A9")(a single emoji) previously returned a length of 1 (character count) instead of 4 (UTF-8 byte count), causing the server to receive aContent-Length: 1header while 4 bytes are actually sent.This is the same class of bug that was fixed for plain
strbodies in #6586 --stris encoded to UTF-8 before measuring, butStringIOwas not. This PR makesStringIOhandling consistent withstrby reading the remaining text, encoding it to UTF-8, and measuring the byte length.Before
After
Changes
src/requests/utils.py: Insuper_len(), detectio.StringIOand read+encode the remaining text to compute the UTF-8 byte length instead of relying ontell().tests/test_utils.py: Addedtest_super_len_stringio_multibytecovering single emoji, mixed content, partially-read StringIO, and position preservation.Test plan
TestSuperLentests pass (ASCII StringIO, BytesIO, partially-read files, etc.)super_len()call