Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Allowed Schemas option broken by bad html entities in input #693

Open
kiprobinsonknack opened this issue Feb 10, 2025 · 4 comments
Open
Labels

Comments

@kiprobinsonknack
Copy link

kiprobinsonknack commented Feb 10, 2025

To Reproduce

See also example script at bottom

  1. Use options to only allow a tags, only allowing href attribute, and only allowing https schema
  2. Sanitize a string where the href's protocol attribute contains html entities in decimal or hex format, but lacking the trailing semicolon
    • Example of html entities: &#112 (decimal), &#x70 (hex)
    • Example of full string to sanitize-html: <a href=htt&#112://example.com>ClickMe</a>

Expected behavior

Expected: The href attribute is stripped from the string. (<a>ClickMe</a>)

Actual: The href remains, with the ampersand in the attribute changed to &amp;. This results in a tag like: <a href="htt&amp;#112://example.com">ClickMe</a>

Describe the bug

If the html entities are correctly formatted with trailing semicolon, we get the expected output. But the whole point of sanitization is to handle bad input. 😄

In any case, this is clearly not https schema, so the attribute should be stripped.

Security Consideration

I don't think this constitutes a security vulnerability. When the browser sees <a href="htt&amp;#112://example.com">ClickMe</a>, it treats that like <a href="file:///htt&amp;#112://example.com">ClickMe</a>, which is to say, a link to /htt& with everything after the # treated as the page hash. Maybe there's some vulnerability if you know the user happens to have a file in a specific spot on their local system with a name ending in ampersand?

Details

Version of Node.js: v22.13.1

Server Operating System: MacOS Version 15.3 (24D60) (also reproducible on ubuntu, but i'm not sure the version)

Additional context:

n/a

Screenshots

n/a

Example Script

const sanitizeHtml = require('sanitize-html');

// Only allow https links
const options = {
  allowedTags: [
    'a',
  ],
  allowedAttributes: {
    a: [
      'href',
    ],
  },
  allowedSchemes: [
    'https',
  ],
};


// Example 1
// Input:    <a href=htt&#112://example.com>ClickMe</a>
// Expected: <a>ClickMe</a>
// Actual:   <a href="htt&amp;#112://example.com">ClickMe</a>
// Notes:
//     * The HTML entity `&#112` lacks the trailing semicolon
//     * The browser treats the actual result like "file:///htt&amp;#112://example.com">ClickMe</a>"
console.log(sanitizeHtml('<a href=htt&#112://example.com>ClickMe</a>', options));

// Example 2
// Input:    <a href=htt&#112;://example.com>ClickMe</a>
// Expected: <a>ClickMe</a>
// Actual:   <a>ClickMe</a>
// Notes:
//     * Unlike Example 1, this one includes the trailing semicolon in the html entity
console.log(sanitizeHtml('<a href=htt&#112;://example.com>ClickMe</a>', options));

// Example 3
// Input:    <a href=htt&#x70://example.com>ClickMe</a>
// Expected: <a>ClickMe</a>
// Actual:   <a href="htt&amp;#x70://example.com">ClickMe</a>
// Notes:
//     * This is the same as Example 1, except it uses hex encoding rather than decimal
console.log(sanitizeHtml('<a href=htt&#x70://example.com>ClickMe</a>', options));


// Example 4
// Input:    <a href=j&#97v&#97script:&#97lert(1)>ClickMe</a>
// Expected: <a>ClickMe</a>
// Actual:   <a href="j&amp;#97v&amp;#97script:&amp;#97lert(1)">ClickMe</a>
// Notes:
//     * This is _NOT_ an XSS vulnerability: the browser treats this link like `file:///j&amp;#97v&amp;#97script:&amp;#97lert(1)`
console.log(sanitizeHtml('<a href=j&#97v&#97script:&#97lert(1)>ClickMe</a>', options));
@boutell
Copy link
Member

boutell commented Feb 10, 2025

this looks like a weird bypass of allowedSchemas, even if the end result is not exploitable, so I agree we need to look at it.

@kiprobinsonknack
Copy link
Author

This might not have anything to do with bad entities actually:

console.log(sanitizeHtml('<a href=htt//example.com>ClickMe</a>', options));
// OUTPUT:  <a href="htt//example.com">ClickMe</a>

@kiprobinsonknack
Copy link
Author

kiprobinsonknack commented Feb 11, 2025

Actually, I'm not sure if this is really a bug after all 😅

When I said this:

When the browser sees <a href="htt&amp;#112://example.com">ClickMe</a>, it treats that like <a href="file:///htt&amp;#112://example.com">ClickMe</a>

I was mistaken; the browser treats it like a relative link to a file named htt&, not a file:// link. In other words, if i'm viewing the HTML file at https://example.com/some/path, and it sees <a href="htt&amp;#112://example.com">ClickMe</a>, that is a link to https://example.com/some/path/htt&. The same way that <a href="whatever.html#somehash">ClickMe</a> is treated like a link to https://example.com/some/path/whatever.html.

Similarly, <a href="htt//example.com">ClickMe</a> is treated like a link to https://example.com/some/path/htt//example.com.

I think there might still be something worth fixing regarding broken html entities. te&#115t is displayed by the browser (at least in Chrome) as test, but sanitize-html turns it into te&amp;#115t which is displayed by the browser as te&#115t.

Whether that is desired or not, I could kind of argue both ways...

@boutell
Copy link
Member

boutell commented Feb 11, 2025

Interesting. You seem to be saying the browser completely ignores the : and everything after it.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants