Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

rustdoc percent-encodes ~ in URLs, breaks links #97125

Closed
lilyball opened this issue May 17, 2022 · 2 comments · Fixed by #99771
Closed

rustdoc percent-encodes ~ in URLs, breaks links #97125

lilyball opened this issue May 17, 2022 · 2 comments · Fixed by #99771
Labels
C-bug Category: This is a bug. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.

Comments

@lilyball
Copy link
Contributor

~

When generating documentation with rustdoc, it appears to percent-encode ~ in link destinations. For an example, see moka 0.8.3. At the bottom of the crate documentation is a link with the title "hierarchical timer wheel". The href in the HTML is http://www.cs.columbia.edu/%7Enahum/w6998/papers/ton97-timing-wheels.pdf, note the %7E, whereas the source is

//! [timer-wheel]: http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf

RFC 1738 declared ~ to be an "unsafe" character, but that was obsoleted 17 years ago by RFC 3986 which explicitly lists ~ as an unreserved character and says that unreserved characters should not be percent-encoded.

The fact that rustdoc encodes this is a problem because it actually breaks links. Case in point, the link from the motivating example here is broken by the percent-encoding. It shouldn't be, but not all servers percent-decode paths before interpreting them. If you click on http://www.cs.columbia.edu/%7Enahum/w6998/papers/ton97-timing-wheels.pdf you get a 404, but if you click on the originally-specified http://www.cs.columbia.edu/~nahum/w6998/papers/ton97-timing-wheels.pdf it works.

^ and other characters

I've also noticed that rustdoc percent-encodes ^, which is annoying when trying to use a link like https://docs.rs/parking_lot/^0.12/parking_lot/type.Mutex.html as it ends up looking ugly. RFC 3986 disallows ^ inside URLs, but the HTML5 spec extends the URL syntax to add ^ to the set of unreserved characters (along with other characters that RFC 3986 omitted). As such, rustdoc should target HTML5's notion of what constitutes a valid URL rather than RFC 3986's definition, as the URLs it produces will be parsed according to the HTML spec.

More generally, rustdoc should attempt to preserve the URL as it was written to the extent possible. This may in fact mean not adding any percent-encoding at all, as the URL is written directly in the markdown and RFC 3986 §2.4 specifies that under normal circumstances, URL-encoding should only be done when producing a URL from its component parts. As rustdoc is not producing a URL from component parts it should probably just leave the URL alone.

Meta

This occurs both in rust 1.60.0 and in the unstable compiler used by docs.rs (currently 1.63.0-nightly (c52b9c10b 2022-05-16)).

@lilyball lilyball added the C-bug Category: This is a bug. label May 17, 2022
@JohnTitor JohnTitor added the T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. label May 17, 2022
@Urgau
Copy link
Member

Urgau commented May 18, 2022

rustdoc use pulldown-cmark for it's markdown parsing and writing.

As I understand the library use a very simple and a bit naive algorithm to determine which character to encode based on this table https://github.com/raphlinus/pulldown-cmark/blob/9bfba94ca849c7d9d75b53ba1f505761954e6290/src/escape.rs#L29-L38 where 1 represent true and the table follows the ascii standard.
We can clearly see on line 7, entry 14 (~ = 0x7E = 126 => 126/16 = 7.875 = 7 + (14/16)) that the entry is 1 instead of 0. The fix is their for to simply put it at 0.

@GuillaumeGomez
Copy link
Member

We just wait for pulldown-cmark to publish a new version and it should be fixed.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jul 26, 2022
…k, r=Urgau

Update pulldown-cmark version to 0.9.2 (fixes url encoding for some chars)

Fixes rust-lang#97125.

r? `@Dylan-DPC`
@bors bors closed this as completed in f8f07de Jul 27, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
C-bug Category: This is a bug. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants