Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: Percent-encode Markdown reserved symbols in URLs #26

Merged
merged 3 commits into from
Oct 24, 2021

Conversation

AndrejsK
Copy link
Contributor

While these symbols are generally considered valid in URLs, they are a part of Markdown syntax. This can cause problems in some MD renderers, like in Discord embeds, where URLs containing these symbols sometimes break.

Code example:

import { NodeHtmlMarkdown } from 'node-html-markdown'
import { NodeHtmlMarkdown as nhmFork } from 'node-html-markdown-fork'


const element1 = "<a href='https://lv.wikipedia.org/wiki/C%C4%93su_kauja_(1578)'>TEST</a>";
const element2 = "<i><a href=\"https://lv.wikipedia.org/wiki/Chandrayaan_1\" title=\"Chandrayaan 1\">Chandrayaan 1</a></i>";

console.log("Current master\n");
console.log(NodeHtmlMarkdown.translate(element1));
console.log(NodeHtmlMarkdown.translate(element2));

console.log("\nMy fork\n")
console.log(nhmFork.translate(element1));
console.log(nhmFork.translate(element2));

Output:

Current master

[TEST](https://lv.wikipedia.org/wiki/C%C4%93su_kauja_(1578))
_[Chandrayaan 1](https://lv.wikipedia.org/wiki/Chandrayaan1 "Chandrayaan 1")_

My fork

[TEST](https://lv.wikipedia.org/wiki/C%C4%93su%5Fkauja%5F%281578%29)
_[Chandrayaan 1](https://lv.wikipedia.org/wiki/Chandrayaan%5F1 "Chandrayaan 1")_

Currently the second example is just plain wrong - an underscore between Chandrayaan and 1 has gone missing and the URL is broken. The first example is less obvious, but the brackets at the end confuse Discord on Android, while the encoded version works just fine.

Since I was already making this change, I also chose to encode the * symbol, as it is also significant in Markdown and could cause problems.

Test results:

yarn run v1.22.10
$ jest
 PASS  test/default-tags-codeblock.test.ts
 PASS  test/options.test.ts
 PASS  test/default-tags.test.ts
 PASS  test/special-cases.test.ts

Test Suites: 4 passed, 4 total
Tests:       54 passed, 54 total
Snapshots:   0 total
Time:        2.456 s, estimated 3 s
Ran all test suites.
Done in 3.56s.

Benchmark:

yarn run v1.22.10
$ cd benchmark && yarn run benchmark quick
$ node execute.js quick

-----------------------------------------------------------------------------

node-html-makrdown (reused instance): 16.0300 ms/file ± 9.15999 (5.8 MB/s)
node-html-markdown                  : 16.3114 ms/file ± 9.81729 (5.7 MB/s)
turndown (reused instance)          : 34.5944 ms/file ± 18.5378 (2.69 MB/s)
turndown                            : 34.9459 ms/file ± 18.7913 (2.66 MB/s)

-----------------------------------------------------------------------------

Total Files: 25
Avg. file size: 95.21 kB

-----------------------------------------------------------------------------

Estimated processing times (fastest to slowest):

  [node-html-makrdown (reused instance)]
    100 kB:  17ms
    1 MB:    172ms
    50 MB:   8.62sec
    1 GB:    2min, 57sec
    50 GB:   2hr, 27min, 7sec

  [node-html-markdown]
    100 kB:  17ms
    1 MB:    175ms
    50 MB:   8.77sec
    1 GB:    2min, 60sec
    50 GB:   2hr, 29min, 42sec

  [turndown (reused instance)]
    100 kB:  36ms
    1 MB:    372ms
    50 MB:   18.60sec
    1 GB:    6min, 21sec
    50 GB:   5hr, 17min, 30sec

  [turndown]
    100 kB:  37ms
    1 MB:    376ms
    50 MB:   18.79sec
    1 GB:    6min, 25sec
    50 GB:   5hr, 20min, 44sec

-----------------------------------------------------------------------------

Speed comparison - node-html-makrdown (reused instance) is: 

  1.02 times as fast as node-html-markdown
  2.16 times as fast as turndown (reused instance)
  2.18 times as fast as turndown

-----------------------------------------------------------------------------

Done in 3.80s.

While these symbols are generally considered valid in URLs, they are a part of Markdown syntax. This can cause problems in some MD renderers, like in Discord embeds, where URLs containing these symbols sometimes break.
Copy link
Collaborator

@nonara nonara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! It looks pretty good. A couple of things and we can merge:

  1. For speed, let's switch the chained regex replace with a single switch iterator (see review code comment)
  2. Need tests ensuring it's encoding properly

@AndrejsK
Copy link
Contributor Author

Requested changes made.

@AndrejsK AndrejsK requested a review from nonara October 24, 2021 12:39
Copy link
Collaborator

@nonara nonara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Thank your for the contribution!

@nonara nonara merged commit 83d4fff into crosstype:master Oct 24, 2021
@nonara
Copy link
Collaborator

nonara commented Oct 24, 2021

Released in v1.1.3

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants