Linkify punycode-encodes em-dash #28

alexwennerberg · 2021-12-26T16:24:32Z

Hi! Thanks for this library -- I use it in my new mailing list software to detect links in emails. Someone brought what appears to be a bug to my attention: https://lists.flounder.online/crabmail/threads/1beaffd2384b.html

Here's my code:
https://git.alexwennerberg.com/crabmail/file/src/utils.rs.html#l22

I think that this could unambiguously be parsed, but I'm not 100% sure. What do you think?

robinst · 2022-02-11T05:47:51Z

Someone brought what appears to be a bug to my attention: https://lists.flounder.online/crabmail/threads/1beaffd2384b.html

Hmm that link doesn't load for me, could you provide a copy here?

alexwennerberg · 2022-02-11T05:52:49Z

Ah, sorry -- I shuffled things around a bit on my site. Here's the fixed link:

https://lists.flounder.online/crabmail/threads/BD094616-4ACC-47F9-BE79-6C61A66A76D7@paritybit.ca.html

robinst · 2022-02-11T06:03:14Z

I see. That's an interesting case, because — can currently be part of an URL, e.g. like this:

https://www.example.com/—

In that case, the whole text including em-dash would get linked.

Also note that GitHub behaves the same way here:

https://www.example.com—
https://www.example.com/—

We could fix the case where it's part of the domain, see also #29 which has some discussion around that. But what would you expect with the case where it's part of the path?

alexwennerberg · 2022-02-11T06:22:26Z

I think that if it's part of the path it should be treated as such. I guess this is a broader question, whether this library should reject invalid TLDs? like:

https://lists.flounder.online/test/threads/YgYAU45J+dZURu1F@localhost.lan.html

I think that the tradeoffs that you've made with the library as written are reasonable though

Came up in a couple of places: #41, #29, #38, #28. Hopefully we can fix all of these with these changes. Not done yet, still want to have domain checking for URLs with certain schemes (https) but allow everything for others. If we do that, we may be able to unify the email and plain domain parsing with the scheme one too.

robinst · 2022-07-11T05:55:37Z

I've reworked domain parsing in 0.9.0 (see https://github.com/robinst/linkify/blob/main/CHANGELOG.md#090---2022-07-11), but I haven't addressed this yet.

I think we could now do this by rejecting TLDs that contain non-alphanumeric Unicode characters. Note that there are TLDs that contain non-ASCII characters, see examples here (but they would be alphanumeric): https://en.wikipedia.org/wiki/Internationalized_country_code_top-level_domain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkify punycode-encodes em-dash #28

Linkify punycode-encodes em-dash #28

alexwennerberg commented Dec 26, 2021

robinst commented Feb 11, 2022

alexwennerberg commented Feb 11, 2022

robinst commented Feb 11, 2022

alexwennerberg commented Feb 11, 2022

robinst commented Jul 11, 2022

Linkify punycode-encodes em-dash #28

Linkify punycode-encodes em-dash #28

Comments

alexwennerberg commented Dec 26, 2021

robinst commented Feb 11, 2022

alexwennerberg commented Feb 11, 2022

robinst commented Feb 11, 2022

alexwennerberg commented Feb 11, 2022

robinst commented Jul 11, 2022