Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Use of escape.Markdown for #text elements #7

Open
chamilad opened this issue Oct 16, 2019 · 3 comments
Open

Use of escape.Markdown for #text elements #7

chamilad opened this issue Oct 16, 2019 · 3 comments
Labels
docs improve documentation

Comments

@chamilad
Copy link

Hello,

I'm using your library for a markdown generation tool for static site generators. The Rule interface is just perfect!

The use of escape for #text elements mostly seem like a problem for me as I read through the code. Would you be able to explain why this was used in the first place? I couldn't understand why certain characters needed to be escaped in the first place.

Thanks!

@JohannesKaufmann
Copy link
Owner

@chamilad great that you like the library!

If the following snippet gets run through the library <p>**Not Strong**</p> it might produce **Not Strong** which would not be what we are expecting. These side-effects happen with quite a few characters ("*" for bold, "_" for italic, "-" for list items, four space characters accidentally creates a code block, ...).


When a header (eg. <h3>) contains any new lines in its body, it will split the header contents
over multiple lines, breaking the header in Markdown (because in Markdown, a header just
starts with #'s and anything on the next line is not part of the header). Since in HTML
and Markdown all white space is treated the same, I chose to replace line endings with spaces.
-> lunny/html2md#6

With escaping, this input will generate this output which is not perfect but close to the original.


@chamilad if you send me some snippets that behave unexpectedly, I'm happy to add some test cases and fix that.

As a Background Information: This library was designed to pipe whole websites through it, meaning it is supposed to handle some weird edge cases.

@estyrke
Copy link

estyrke commented Feb 2, 2021

Hi there! First, thanks for a great library! Second, I have an example that behaves unexpectedly:

The document I'm converting contains maths equations such as <span class="tex2jax_process">$L’ = (1+n \cdot C) \cdot L$</span>. Amazingly, this almost works out of the box since the $$ syntax is apparently used in some Markdown flavors as well. However, I get $L’ = (1+n \\cdot C) \\cdot L$, i.e. the backslashes before cdot are escaped. I would need them "raw": $L’ = (1+n \cdot C) \cdot L$.

If this is a corner case that breaks something else, then I'm happy to just write my own rule to override the default one, just thought I'd mention this.

@JohannesKaufmann
Copy link
Owner

@estyrke Yeah, you are right that is a bug. Unfortunately, it's not that easy to fix.

I have thought about a new approach that might make escaping more reliable (also resolving #19), but that requires a substantial refactor. And I don't have time for that at the moment 🤷‍♂️


For now, you can create a custom rule for "span" and register it using AddRules.

Then check whether the element has the classname “tex2jax_process” using selec.HasClass.

If it has return selec.Text() instead of content. That gets you the original text that is not escaped.

If it does not have the classname, return nil which is then going to run the default rule.

Let me know if you have any problems...

@JohannesKaufmann JohannesKaufmann added the docs improve documentation label May 17, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
docs improve documentation
Projects
None yet
Development

No branches or pull requests

3 participants