-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
UTF-8 decoding of headers (revise #207) #3279
Comments
GitMate.io thinks possibly related issues are #3270 (UnicodeDecodeError: 'utf-8' codec can't decode byte ...), #1750 (Encoding is always UTF-8 in POST data), #1652 (Trailer headers), #1731 (UnicodeEncodeError: 'utf-8' codec can't encode character '\udca9'), and #18 (Auto-decoding doesn't recognize content-type: application/json; charset=utf-8). |
It is the deliberate choice. Good guys should use ASCII only :) If you still need to receive and process non-UTF8 strings -- use raw_headers with explicit decoding by any desired codec. |
I couldn't agree more regarding good guys. But what should we do with the rest? :)
Almost the same. Small pros and cons on both sides. You have chosen the 2nd one. Well, I see your point and can accept your decision. Thank you. |
N.B. According to my research last summer (cherrypy/cheroot#27 (comment)), none of the mainstream HTTP clients (browsers) actually try to decode unicode headers. |
If it matters WSGI follows the standard way. |
The ship has sailed in 2014. |
Long story short
Header fields are decoded using UTF-8 +
surrogateescape
error handler. It generates surrogates when the headers are Latin1 (ISO-8859-1) encoded. The resulting string needs extra care: You can't encode it without the samesurrogateescape
error handler but surprisingly it is serializable using JSON which causes headache later.Expected behaviour
According to RFC7230:
Header fields should be decoded using ISO-8859-1 without
surrogateescape
error handler.Actual behaviour
The current modification was made still in 2014 (see the old ticket #207 ) to resolve an issue of an UTF-8 encoded header. As the author mentioned it was against the spec. Indeed. He was right. It doesn't cause any issue with ASCII or UTF-8 headers but Latin1 headers which follow the standards. So the good guys are punished :/
Steps to reproduce
Just start up any aiohttp server and send a message where the header contains non US ASCII character. I could not send it properly via curl but managed to do it with requests:
>>> requests.post("http://localhost:8000/", json={"requests": "data"}, headers={"User-Agent": "Versión"})
You will see the value is
b"Versi\xf3n"
inrequest.raw_headers
.In
request.headers
the value is"Versi\udcf3n"
.Your environment
aiohttp 2.3.10 (server) - not the latest but as I see this part is still the same
python 3.5.2
Linux Ubuntu 16.04 LTS
The text was updated successfully, but these errors were encountered: