Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

wbr element shouldn't be balanced #488

Open
jcushman opened this issue Nov 27, 2019 · 4 comments
Open

wbr element shouldn't be balanced #488

jcushman opened this issue Nov 27, 2019 · 4 comments

Comments

@jcushman
Copy link

The <wbr> element is balanced by bleach.clean even though it is an empty element.

Using the list of empty tags from MDN:

In [6]: empty_elements = {
   ...:     'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'meta', 'param', 'source', 'track', 'wbr'
   ...: }

In [7]: html = "".join("<%s>" % s for s in empty_elements)

In [8]: import bleach

In [9]: bleach.clean(html, tags=empty_elements)
Out[9]: '<param><source><hr><base><track><area><wbr></wbr><br><img><keygen></keygen><link><input><meta><embed>'

The output includes <wbr></wbr> when it should just be <wbr> like the others. keygen has the same problem, but that's deprecated so I'm not sure if it's worth including.

@g-k
Copy link
Collaborator

g-k commented Dec 4, 2019

hmm yeah I can reproduce. wbr is listed as a self closing tag on:

(("area", "br", "embed", "img", "keygen", "wbr"),
self.startTagVoidFormatting),

and should have:

token["selfClosingAcknowledged"] = True

but I get

{'type': 'StartTag', 'name': 'wbr', 'namespace': None, 'data': OrderedDict()}
{'type': 'EndTag', 'name': 'wbr', 'namespace': None}

at https://github.com/mozilla/bleach/blob/master/bleach/sanitizer.py#L271 so I'm thinking one of these things might be going on:

  • html5lib incorrectly parses it as a self closing tag (but didn't see an upstream issue)
  • tagOpenState or another method in html5lib_shim.py leaves the parser in a bad state that causes it to not be recognized as a self closing tag
  • the tags arg doesn't pass the tag a self closing tag

but I'll need to find more time to look into it further.

@g-k g-k added the clean label Sep 16, 2020
@g-k
Copy link
Collaborator

g-k commented Sep 16, 2020

OK this is a bug in html5lib (v1.1 at least):

» python
Python 3.8.2 (default, Mar 26 2020, 12:39:19)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bleach._vendor.html5lib as html5lib
>>> html5lib.__version__
'1.1'
>>> html5lib.serialize(html5lib.parseFragment('<area>')) # this is correct
'<area>'
>>> html5lib.serialize(html5lib.parseFragment('<wbr>')) # should be <wbr>
'<wbr></wbr>'
>>> html5lib.serialize(html5lib.parseFragment('<keygen>')) # HTML 5.2 deprecates the tag
'<keygen></keygen>'
>>> html5lib.serialize(html5lib.parseFragment('<menuitem>')) # https://github.com/html5lib/html5lib-python/issues/203 mentions this but https://developer.mozilla.org/en-US/docs/Web/HTML/Element/menuitem shows non-void examples and says HTML 5.2 deprecates it
'<menuitem></menuitem>'

the upstream issue is html5lib/html5lib-python#203
upstream PR for wbr html5lib/html5lib-python#395

Not sure what html5lib's position on deprecated elements is.

g-k pushed a commit that referenced this issue Sep 16, 2020
g-k pushed a commit that referenced this issue Sep 16, 2020
g-k pushed a commit that referenced this issue Sep 16, 2020
@g-k g-k added the html5lib label Jan 25, 2021
@ambv
Copy link

ambv commented Mar 2, 2023

This is now addressed in html5lib:
html5lib/html5lib-python#395

@willkg
Copy link
Member

willkg commented Oct 25, 2024

Waiting on an html5lib release with this fix. Then we can update the vendored html5lib and test everything.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants