Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Linkify punycode-encodes em-dash #28

Open
alexwennerberg opened this issue Dec 26, 2021 · 5 comments
Open

Linkify punycode-encodes em-dash #28

alexwennerberg opened this issue Dec 26, 2021 · 5 comments

Comments

@alexwennerberg
Copy link

Hi! Thanks for this library -- I use it in my new mailing list software to detect links in emails. Someone brought what appears to be a bug to my attention: https://lists.flounder.online/crabmail/threads/1beaffd2384b.html

Here's my code:
https://git.alexwennerberg.com/crabmail/file/src/utils.rs.html#l22

I think that this could unambiguously be parsed, but I'm not 100% sure. What do you think?

@robinst
Copy link
Owner

robinst commented Feb 11, 2022

Someone brought what appears to be a bug to my attention: https://lists.flounder.online/crabmail/threads/1beaffd2384b.html

Hmm that link doesn't load for me, could you provide a copy here?

@alexwennerberg
Copy link
Author

Ah, sorry -- I shuffled things around a bit on my site. Here's the fixed link:

https://lists.flounder.online/crabmail/threads/BD094616-4ACC-47F9-BE79-6C61A66A76D7@paritybit.ca.html

@robinst
Copy link
Owner

robinst commented Feb 11, 2022

I see. That's an interesting case, because can currently be part of an URL, e.g. like this:

https://www.example.com/—

In that case, the whole text including em-dash would get linked.

Also note that GitHub behaves the same way here:

https://www.example.com—
https://www.example.com/—

We could fix the case where it's part of the domain, see also #29 which has some discussion around that. But what would you expect with the case where it's part of the path?

@alexwennerberg
Copy link
Author

I think that if it's part of the path it should be treated as such. I guess this is a broader question, whether this library should reject invalid TLDs? like:

https://lists.flounder.online/test/threads/YgYAU45J+dZURu1F@localhost.lan.html

I think that the tradeoffs that you've made with the library as written are reasonable though

robinst added a commit that referenced this issue Jun 27, 2022
Came up in a couple of places: #41, #29, #38, #28. Hopefully we can fix
all of these with these changes.

Not done yet, still want to have domain checking for URLs with certain
schemes (https) but allow everything for others.

If we do that, we may be able to unify the email and plain domain
parsing with the scheme one too.
@robinst
Copy link
Owner

robinst commented Jul 11, 2022

I've reworked domain parsing in 0.9.0 (see https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11), but I haven't addressed this yet.

I think we could now do this by rejecting TLDs that contain non-alphanumeric Unicode characters. Note that there are TLDs that contain non-ASCII characters, see examples here (but they would be alphanumeric): https://en.wikipedia.org/wiki/Internationalized_country_code_top-level_domain

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants