-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Parsing large HTML files uses way too much memory #31
Comments
I have at least partially confirmed high memory use for large HTML files. SummaryThe word size of the result of parsing HTML can be significantly larger than the source HTML, increasing with node density. An 19MB HTML file is parsed by Meeseeks to an estimated 280MB document (excluding binary data size), while a 500KB HTML file parsed to an estimated 3MB (excluding binary data size). This increase is less significant when parsing into a tuple-tree, and Html5ever (the Elixir library) parses the same files to tuple-trees around 112MB and 1.3MB (excluding binary data size) respectively. DataFor a 500KB (on disk) HTML file parsing to about 11,000 nodes,
This excludes most of the binary data (any larger than 64 bytes), but already gives an example of how a large HTML document uses more memory when parsed. Both implementations that use flat-maps to represent the HTML document weigh in at ~6x the original document size, and even the tuple-tree representation weighs in at ~2.5x. So how does this scale up to even larger documents? To find out I created a file duplicating the largest portion of the original document (a big
The bytes per node have scaled close to linearly, but given the density of nodes, the flat-map representations are ~14x and ~12x the original document size, while the tuple-tree representation is ~5.5x. In both examples, the flat-map representations seem to be about 2.5x the size of the tuple-tree representation. Memory PressureSampling total memory allocation with 500KB File:
19MB FIle:
DiscussionIt is not unexpected that the flat-map representation would be bigger than the tuple-tree representation, since flat-maps both make explicit the relationship between nodes that a tuple-tree holds implicitly (more data), and represent nodes using maps instead of tuples (less memory efficient). It is unfortunate, however, that parsing a 19MB (albeit node-dense) HTML file can yield ~300-800MB of memory usage. Next StepsI'm not sure yet. I'm open to suggestions. |
Regarding the difference between maps and tuples one solution is to just wait. OTP 21 will include an optimisation that should make structs (and all other maps where keys are statically known at compile-time) have a comparable size to tuples. Tuples consume It would be nice to have some API allowing updating multiple keys at once - I think this is something that should be consulted with the OTP team. This should significantly decrease the amount of garbage created in the process. When it comes to optimising the size of the tree itself, I think the primary way would be through increasing sharing. There's often a lot of empty nodes (such as |
Two aspects of OTP 21 have combined to make the situation better, though still not ideal.
I also gave Michał's term cache idea a shot, but my first naive solution didn't show any positive results. I'll probably explore it more in the future. All in all, OTP 21 is a pretty big win for Meeseeks memory usage. |
I have received reports that parsing large (~18MB) HTML files resulted in completely insane memory use (1GB+).
This needs to be confirmed, and if so, addressed.
The text was updated successfully, but these errors were encountered: