-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
perf: avoid cheerio in favor of regex #404
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Hey @GalacticHypernova @Baroshem Here is a description of how it works.
Now at a higher level, these three routines are called at the time the html is rendered, with the Cheerio is used to parse the content of each section and find out the relevant parts of the HTML that need to be modified. This is why I added 02a-preprocessHtml and 99b-recombineHtml : it avoids parsing the same sections each time. So 02a-preprocessHtml starts by loading each of the sections in the cheerio parser, via As you can see 02a and 99b were added afterwards, for performance reasons only. |
@vejja thanks for the explanation! This really does help a lot, and will be very useful for the PR. Does that mean no further modification needs to be made for sections in 02a (before the 3 routines)? No need to trim or anything like that? I was just wondering if it actually needs to be modified at all. |
I think if you can get rid of 02a and 99b, it would be fantastic. Honestly they are a legacy of trying to quick-fix the cheerio issues |
👍 |
If I am correct (I haven't worked much with nitro's HTML hooks myself), the bodyAppend/prepend/etc return strings, right? They just return the stringified HTML? Because I see a |
I can’t remember but I think Typescript gives you the type Edit: it’s actually an array of strings: referenced in https://nuxt.com/docs/api/advanced/hooks#nitro-app-hooks-runtime-server-side |
Thanks! I remembered that vaguely but I also didn't get the chance to check it myself. |
@vejja in 03, can external scripts have |
I think in theory it should always be the first syntax as <script> is not supposed to be self-closing. So my mistake for suggesting the other way round. Update: Let's keep our lives simple and not support Thanks for bringing this up @GalacticHypernova |
Also if you're wondering what regex we used to have before Cheerio, you can check out: |
Thanks!
Yea, I made the patterns a bit more specific so that it would be easier to handle, though we would likely want to test its performance. |
Up to you. You're absolutely right that otherwise they won't have integrity if they use the self-closing syntax |
@@ -2,6 +2,8 @@ import { useStorage, defineNitroPlugin, getRouteRules } from '#imports' | |||
import { isPrerendering } from '../utils' | |||
import { type CheerioAPI } from 'cheerio' | |||
|
|||
const SCRIPT_RE = /<script((?=[^>]+src="[\w:.-\\/]+")(?![^>]+integrity="[\w-]+")[^>]+)\/>/g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Progress report on the script regex:
const SCRIPT_RE = /<script((?=[^>]+src="[\w:.-\\/]+")(?![^>]+integrity="[\w-]+")[^>]+)(?:\/>|><\/script>)/g
const a =[
`<script src="a.k"/><script src="b"></script>`,
`<div>`,
`<script>c</script>`,
`<script src="a" integrity="a"></script> <script src="b"/>`
]
console.log(a.map(r=>r.replace(SCRIPT_RE,(match, mb)=>{return `<script integrity="a"${mb}></script>`})))
It works for both self-closing and non-self-closing script tags and automatically converts them to non-self-closing.
We could go out of the way to warn users during build to not use self-closing syntax, but I'll leave that up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice !
Can you test also for
- integrity attribute before src attribute
- random attribute (e.g.
data-xxx
) before and after src attribute - no src attribute with self-closing syntax
- many scripts in the same string
As well as edge cases such as :
<script>console.log('I like to mess up with javascript by printing </script>')</script>
<p> I like to mess up with pre tags such as <pre> a code example: <script></script> </pre> </p>
or an even more evil version:<p> I like to mess up with pre tags such as <pre> a code example: <script> that is non-closing by mistake </pre> </p>
<script><!-- I like to mess up with HTML comments such as </script> --></script>
or its variant:<script><!-- I like to mess up with HTML comments such as <script> --></script>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do and post pictures soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Response for no src self-closing script:
Tested with this sample (same as in the PR):
const SCRIPT_RE = /<script((?=[^>]+src="([\w:.-\\/]+)")(?![^>]+integrity="[\w-]+")[^>]+)(?:\/>|><\/script>)/g
const sriHashes = {
"a.k":"a",
"b":"c",
"script.js":"newIntegrity"
}
const a = [
`<script src="a.k"/><script src="b"></script>`, //double scripts
`<div>`, //non-script
`<script>c</script>`, //inline script
`<script src="a" integrity="a"></script> <script src="b"/>`, //mixed script (one with integrity, one without)
`<script integrity="integrity" src="script.js"></script>`, //integrity before src
`<script data-xxx src="script.js"></script>`, //random attr before
`<script src="script.js" data-yyy></script>`, //random attr after
`<script data-xxx src="script.js" data-yyy></script>`, //random attr before and after
`<script src="script1.js"></script><script src="script2.js"></script>`, //multiple scripts in the same line
`<script/>`, // self closing with no src
`<script>console.log('I like to mess up with javascript by printing </script>')</script>`, //first edge case - print
`<p> I like to mess up with pre tags such as <pre> a code example: <script></script> </pre> </p>`, // second edge case - <pre>
`<p> I like to mess up with pre tags such as <pre> a code example: <script> that is non-closing by mistake </pre> </p>`, // second edge case - <pre> (evil version)
`<script><!-- I like to mess up with HTML comments such as </script> --></script>`, // third edge case - comment inside script
`<script><!-- I like to mess up with HTML comments such as <script> --></script>` // third edge case - comment inside script (variant)
];
console.log(a.map(r=>r.replace(SCRIPT_RE,(match, rest, src: Partial<keyof typeof sriHashes>)=>{
const hash = sriHashes[src]
if (hash) {
return `<script integrity="${hash}"${rest}></script>`
}
return match
})))
Please let me know if anything got incorrect results!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thanks a lot !!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, one more thing. Would it be too much of an issue if "valid" external scripts would get integrity? I just tested with
console.log(<script src="a.k"></script>)
and it returned the the script with the integrity, but I don't really think that would cause any trouble, would it?
I am not too sure to be honest
- On the one hand, in theory this is a problem that could lead to many issues. We are modifying user code by doing XSS ourselves here, which looks really bad for a security module
- On the other hand, I don't even know how browsers are supposed to interpret this...
This might be the reason why the author of https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/99-cspNonce.ts was doing some lookbehind search around quotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, unfortunately you can't win them all with regex.. But at least the edge cases and potential issues are covered, so that's something
Sorry, I was commenting on your idea of a WeakMap cache
Our comments on integrity injection crossed at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, all good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be the reason why the author of https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/99-cspNonce.ts was doing some lookbehind search around quotes
Yea, but at the same time it could be something like:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that straight-forward unfortunately, it needs to be very thoroughly planned for it to be executed correctly because there are many things that could go wrong here. But I'm afraid that until that happens, we might want to focus on optimizing cheerio as much as possible.
Also what about links? Are they always self closing? Or should I treat them the same as script tags? |
I think |
We might want to look into a hybrid approach, at least until a suitable solution is found for the edge cases that are harder to catch with regex. That is, for external scripts/links/etc, regex would be used, but for inline stuff cheerio. That way it would still increase performance (as it would be used less), but wouldn't compromise integrity entirely (again, as a temporay solution). |
Isn't an inline script the same as an external script except that it doesn't have an src attribute ? |
Well, not exactly. Assuming the worst case scenario, that everyone using this module is trying to cause harm, there are a lot more ways for inline scripts to cause trouble. |
Hey folks, any progress on this? I would love to see this being added to the upcoming 2.0.0 release :) |
Hey! Sorry for the slow progress, been very busy with real life matters. I'm currently trying to debug the errors, I believe those are the only blockers. If you have any idea why they may be popping up I could push a fix, maybe even today. |
@vejja do you have any idea why the tests are failing? Nothing in this PR seems to be the cause of this. Is the issue in the unit tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now
Only substantial modification was to make 'as' optional in LINK_RE when scanning ssgHashes
Thanks @GalacticHypernova for doing all the heavy-lifting here with regexes
This is a huge improvement that is expected to have a significant impact in edge runtimes
💚
@vejja glad I could help! I just noticed I accidentally made as required when making the pattern, sorry about that and thanks for catching it! It's a bit hard to see that on phone 🥲 |
I have 2 main candidates for the optional as pattern: Any preferences? The only difference is that in the first pattern the positive lookahead adds an empty alternative with In terms of performance, from what I tested they perform similarly, but I guess only time will tell with full-fledged websites. |
Can you give an example where the 2 patterns would give a different result ? |
I can try to find some edge case but from what I have tested thus far its the same, only expressed differently. The first pattern essentially tells the regex engine to make sure there is either the as attribute or an empty alternative (which would always be true since it's a fallback), and the second pattern (the one you suggested) tells the engine to look for a potential match with the as attribute, meaning it will also accept nothing as it's only a potential match. |
Types of changes
Description
This PR integrates native solution via regex to load and parse HTML, in order to improve runtime performance.
Closes #341
Checklist: