-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
urllib3.util.parse_url() handling of URLs that contain Unicode in hostname #799
Comments
That would be a lot of work though. Do you have the time? |
I mean, it's only actually used in 4 places… The only major difference I foresee here is that the "bad URL" check would be much less likely to fail because of a parsing error (which as far as I can tell only happens in …I've actually discovered this while attempting to add netloc to the list of supported sort fields (and subsequently trying to figure out why netloc is obtained differently in different source files). I don't believe that sorting by punycode-encoded netloc would produce the same results as when using the non-encoded one (and displaying such an encoded netloc to the user – in UI or console output – would certainly be treated as a bug) 😅 The alternative would be to use both functions in the same source file (one to obtain netloc and otherwise the other) |
I would rather have 1 version only. Keep the changes as minimal as possible. |
test_bukuDb.py
includes a test bookmark for URL'http://www.zażółćgęśląjaźń.pl/'
. To my understanding, this means such URLs are intended to be supported by Buku.However, when parsing it with
urllib3.util.parse_url()
, the.netloc
of produced URL is encoded with punycode:'www.xn--zaglja-cxa0mpa5p6q5a80a6ota.pl'
. This might be considered to be "correct" behaviour in the technical sense (since RFC limits allowed hostname chars to ASCII), but not for the purposes of "extracting the relevant substring", which is what Buku uses it for. (To compare, Bukuserver usesurllib.parse.urlparse()
which produces the expected value'www.zażółćgęśląjaźń.pl'
.)Additionally, using
parse_url()
to parse URLs with Unicode hostnames implicitly requiresidna
library to be installed (urllib3
v2+ includes this information in the cause exception, while v1.26.x just raises a generic "failed to parse" error when invoked with such a URL).Because of these reasons, I strongly suggest that URL parsing in Buku is switched from using
urllib3.util.parse_url()
back tourllib.parse.urlparse()
.The text was updated successfully, but these errors were encountered: