-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Unopened HTML tag causes exception in budoux 0.6 #355
Comments
Thanks for reporting this. The change is due to the non-breaking markup support we introduced at #251, where we need to track the elements in a queue. Could you elaborate your specific use case that you want to include a close tag with no corresponding open tag? I acknowledge that we need to improve the error message, but BudouX is intended to work with a valid document fragment. |
Thanks. It was actually a content bug to have a missing open tag, but I thought worth reporting since it's a change in behaviour. I don't need this to be a supported case, though maybe a nicer exception would be useful - otherwise happy for you to close as expected behaviour. |
Thanks for your input. Raising a better exepction sounds like a plan.
We may need to check void elements as well to manage the element queue and the document's validity better. |
Thanks for reporting this issue! I think the parser should gracefully handle unpaired close tags. Let me look into this, similar to how browsers handle such case. For self-closing tags such as |
google#251 assumed that all tags are closed properly. This assumption doesn't stand for cases like: 1. Self-closing tags such as `<img>` don't have corresponding close tags. 2. Unpaired close tags are still valid HTML. This patch supports these cases by assuming all open tags that doesn't nest correctly or that doesn't close are automatically closed. This isn't the full HTML "adoption agency algorithm", but it should be good enough for the needs of BudouX. Fixes google#355
* Fix unpaired close tags and self-closing tags #251 assumed that all tags are closed properly. This assumption doesn't stand for cases like: 1. Self-closing tags such as `<img>` don't have corresponding close tags. 2. Unpaired close tags are still valid HTML. This patch supports these cases by assuming all open tags that doesn't nest correctly or that doesn't close are automatically closed. This isn't the full HTML "adoption agency algorithm", but it should be good enough for the needs of BudouX. Fixes #355
This patch changes Java `HTMLProcessor` not to emit close tags if the tag is self-closing. Also adds tests for: * Unpaired close tags. * Self-closing tags don't affect skip nodes (e.g., `<nobr>`.) These test cases are from google#355. Fixes google#361.
With budoux 0.6.0
The text was updated successfully, but these errors were encountered: