Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Unopened HTML tag causes exception in budoux 0.6 #355

Closed
johncarter-phntm opened this issue Nov 7, 2023 · 4 comments · Fixed by #360
Closed

Unopened HTML tag causes exception in budoux 0.6 #355

johncarter-phntm opened this issue Nov 7, 2023 · 4 comments · Fixed by #360
Assignees

Comments

@johncarter-phntm
Copy link

With budoux 0.6.0

budoux --html "foo</p>"

Traceback (most recent call last):
  File "/home/johnc/.local/bin/budoux", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/main.py", line 187, in main
    print(_main(test))
          ^^^^^^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/main.py", line 171, in _main
    res = parser.translate_html_string(inputs_html)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/parser.py", line 102, in translate_html_string
    return resolve(chunks, html)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/html_processor.py", line 124, in resolve
    resolver.feed(html)
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/html/parser.py", line 172, in goahead
    k = self.parse_endtag(i)
        ^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/html/parser.py", line 413, in parse_endtag
    self.handle_endtag(elem)
  File "/home/johnc/.local/pipx/venvs/budoux/lib/python3.11/site-packages/budoux/html_processor.py", line 84, in handle_endtag
    self.to_skip = self.element_stack.get_nowait()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/queue.py", line 199, in get_nowait
    return self.get(block=False)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/johnc/.pyenv/versions/3.11.3/lib/python3.11/queue.py", line 168, in get
    raise Empty
_queue.Empty
@tushuhei
Copy link
Member

tushuhei commented Nov 8, 2023

Thanks for reporting this. The change is due to the non-breaking markup support we introduced at #251, where we need to track the elements in a queue.

Could you elaborate your specific use case that you want to include a close tag with no corresponding open tag? I acknowledge that we need to improve the error message, but BudouX is intended to work with a valid document fragment.

@johncarter-phntm
Copy link
Author

Thanks. It was actually a content bug to have a missing open tag, but I thought worth reporting since it's a change in behaviour.

I don't need this to be a supported case, though maybe a nicer exception would be useful - otherwise happy for you to close as expected behaviour.

@tushuhei
Copy link
Member

tushuhei commented Nov 8, 2023

Thanks for your input. Raising a better exepction sounds like a plan.
That said, the element queue's emptiness is not enough to detect if the given HTML string is valid or not. The string like below will work with budoux==0.6.0 CLI even thought the string is not a valid HTML.

budoux --html "あの空を見る限り<img>今日はとてもいい天気ですね。</p>"

We may need to check void elements as well to manage the element queue and the document's validity better.
@kojiishi any thoughts on this?

@kojiishi
Copy link
Collaborator

kojiishi commented Nov 8, 2023

Thanks for reporting this issue!

I think the parser should gracefully handle unpaired close tags. Let me look into this, similar to how browsers handle such case.

For self-closing tags such as <img>, I think it's supported, but we don't have tests. I'll look into it too.

@kojiishi kojiishi self-assigned this Nov 8, 2023
kojiishi added a commit to kojiishi/budoux that referenced this issue Nov 10, 2023
google#251 assumed that all tags are closed properly.

This assumption doesn't stand for cases like:
1. Self-closing tags such as `<img>` don't have corresponding close tags.
2. Unpaired close tags are still valid HTML.

This patch supports these cases by assuming all open tags that doesn't
nest correctly or that doesn't close are automatically closed.

This isn't the full HTML "adoption agency algorithm", but it should be
good enough for the needs of BudouX.

Fixes google#355
kojiishi added a commit that referenced this issue Nov 11, 2023
* Fix unpaired close tags and self-closing tags

#251 assumed that all tags are closed properly.

This assumption doesn't stand for cases like:
1. Self-closing tags such as `<img>` don't have corresponding close tags.
2. Unpaired close tags are still valid HTML.

This patch supports these cases by assuming all open tags that doesn't
nest correctly or that doesn't close are automatically closed.

This isn't the full HTML "adoption agency algorithm", but it should be
good enough for the needs of BudouX.

Fixes #355
kojiishi added a commit to kojiishi/budoux that referenced this issue Nov 11, 2023
This patch changes Java `HTMLProcessor` not to emit close tags if the tag is self-closing.

Also adds tests for:
* Unpaired close tags.
* Self-closing tags don't affect skip nodes (e.g., `<nobr>`.)
These test cases are from google#355.

Fixes google#361.
kojiishi added a commit that referenced this issue Nov 13, 2023
* Java: Stop emitting close tags if self-closing

This patch changes Java `HTMLProcessor` not to emit close tags if the tag is self-closing.

Also adds tests for:
* Unpaired close tags.
* Self-closing tags don't affect skip nodes (e.g., `<nobr>`.)
These test cases are from #355.

Fixes #361.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants