Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

urllib3.util.parse_url() handling of URLs that contain Unicode in hostname #799

Closed
LeXofLeviafan opened this issue Dec 1, 2024 · 3 comments · Fixed by #800
Closed

urllib3.util.parse_url() handling of URLs that contain Unicode in hostname #799

LeXofLeviafan opened this issue Dec 1, 2024 · 3 comments · Fixed by #800
Labels

Comments

@LeXofLeviafan
Copy link
Collaborator

LeXofLeviafan commented Dec 1, 2024

test_bukuDb.py includes a test bookmark for URL 'http://www.zażółćgęśląjaźń.pl/'. To my understanding, this means such URLs are intended to be supported by Buku.

However, when parsing it with urllib3.util.parse_url(), the .netloc of produced URL is encoded with punycode: 'www.xn--zaglja-cxa0mpa5p6q5a80a6ota.pl'. This might be considered to be "correct" behaviour in the technical sense (since RFC limits allowed hostname chars to ASCII), but not for the purposes of "extracting the relevant substring", which is what Buku uses it for. (To compare, Bukuserver uses urllib.parse.urlparse() which produces the expected value 'www.zażółćgęśląjaźń.pl'.)

Additionally, using parse_url() to parse URLs with Unicode hostnames implicitly requires idna library to be installed (urllib3 v2+ includes this information in the cause exception, while v1.26.x just raises a generic "failed to parse" error when invoked with such a URL).

Because of these reasons, I strongly suggest that URL parsing in Buku is switched from using urllib3.util.parse_url() back to urllib.parse.urlparse().

@jarun
Copy link
Owner

jarun commented Dec 2, 2024

That would be a lot of work though. Do you have the time?
And it may also invite unforeseen issues.

@LeXofLeviafan
Copy link
Collaborator Author

LeXofLeviafan commented Dec 2, 2024

I mean, it's only actually used in 4 places…

The only major difference I foresee here is that the "bad URL" check would be much less likely to fail because of a parsing error (which as far as I can tell only happens in urllib when parsing IPv6 URLs). And of course it wouldn't be raising an exception defined by urllib3.

…I've actually discovered this while attempting to add netloc to the list of supported sort fields (and subsequently trying to figure out why netloc is obtained differently in different source files). I don't believe that sorting by punycode-encoded netloc would produce the same results as when using the non-encoded one (and displaying such an encoded netloc to the user – in UI or console output – would certainly be treated as a bug) 😅

The alternative would be to use both functions in the same source file (one to obtain netloc and otherwise the other)

@jarun
Copy link
Owner

jarun commented Dec 3, 2024

I would rather have 1 version only. Keep the changes as minimal as possible.

LeXofLeviafan added a commit to LeXofLeviafan/buku that referenced this issue Dec 8, 2024
@jarun jarun closed this as completed in #800 Dec 8, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Jan 12, 2025
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants