Really bad performance #63

demanuel · 2021-04-04T11:27:12Z

Hi!

I'm trying to parse a 52 MByte XML file and the performance is really bad.

I'm trying to follow the instructions and just doing:

my $XML = 'ec_inventory_en.xml';
sub MAIN(){
    my $xml = from-xml-file($XML);
}

This code will use more than 5Gbytes of memory [1], only one core is used [2] and it takes more than 3m30s (in comparison a perl version takes around 15 seconds to parse the file)

[1] - Reported by cat /proc/$PID/smaps | grep -i pss | awk '{Total+=$2} END {print Total/1024" MB"}'
[2] -

The text was updated successfully, but these errors were encountered:

jonathanstowe · 2021-05-29T07:37:29Z

Hi,
This module is written in pure Raku and is likely to suffer in comparison to parsers in other languages that may use a C library to do the same thing.

If you are concerned about performance then you may want to consider the LibXML binding https://modules.raku.org/dist/LibXML:cpan:WARRINGD

demanuel · 2021-05-29T14:36:26Z

I understand that, but taking 3m30s and 5GiB just to load a 50MByte file.... Is not OK.

jonathanstowe · 2021-05-29T16:20:07Z

Sure,
My intuition is that the memory usage is due to a very large number of objects, and the speed is due to the allocation of those objects. I guess someone could profile the code to determine if there are unnecessary objects being created and retained.

Another line of attack would be to profile the performance of the Grammar on its own (that is without the action that builds the object tree,) and see if any improvements can be made there.

There are probably some micro-optimisations that could be made within the code but you probably wouldn't want to start on that until profiling has revealed the places that would be of benefit.

2colours · 2022-07-22T18:14:25Z

It's good to see it's not just me. I wanted to use XML::Entity::HTML that depends on this module. Turns out that out of 160 seconds of rendering a 1MB HTML on 4 cores, 140 went straight to escaping tags, when the named HTML barely had any tags to escape in the first place!

alabamenhu · 2023-02-05T01:06:20Z

This happens because of the way this module was structured. On the one hand, it's a great example of some very cool Raku features but… they're also ones that haven't been very well optimized. Lots of but, Proxy, runtime references like ::('Foo') etc, that will slow stuff down substantially.

Won't help with memory, but should help with speed: most classes will use method-call syntax for attributes ($.foo) when they really can reference the attribute directly ($!foo). I don't think the optimizer is smart enough to optimize that away.

This can almost certainly be optimized by a LOT, but I'm not sure if it can be done while maintaining 100% backwards compatibility. Guess I'll give it a try.

supernovus · 2023-02-21T16:09:38Z

It's certainly mostly my fault the XML module is slow as molasses in January. When I first worked on this in 2010, I wanted to try using all of the cool language features that had drawn me to Raku in the first place, and had more emphasis on that than on performance. Subsequent updates only focused on trying even more cool new features.

I'd planned on eventually writing an add-on extension using LibXML2 bindings (I see there is at least one module doing that now) but using the simpler API this module provides.

All of the amazing developers who have worked on this since I abandoned my Raku libraries a decade ago, have improved it substantially, and they are all saints for working on the convoluted codebase I left behind.

2colours · 2023-02-21T17:04:16Z

@supernovus don't worry - it's really much better that you have left a lot of stuff to work with/on, than just silently abandoning them. Also, the code isn't all that bad really... when I started looking into it, what struck me is the "builder pattern" everywhere. I thought that would be an immediate and straight-forward place for improvements - but then @alabamenhu started actually making changes and reported that there aren't really easy gains with an eager system - in which case I also wouldn't say it's really your fault.

@alabamenhu are you planning to adopt this module, by the way? Not pressuring you in either direction, just curious. And if you don't, maybe your changes could be merged back into this repo, with a new version published perhaps.

alabamenhu · 2023-02-22T12:53:39Z

@alabamenhu are you planning to adopt this module, by the way? Not pressuring you in either direction, just curious. And if you don't, maybe your changes could be merged back into this repo, with a new version published perhaps.

Not sure if I'll adopt it per se, but I'll see what I can do to work with it. One important thing to consider here: this is a pure Raku module, and that has major value even if a wrapper for LibXML would be faster. There's no guarantee that LibXML will be available on any given system, so a fully vanilla Raku module is a good thing.

One thing that MIGHT be faster potentially is what I did for parsing number format strings in Intl::Format::Number, which is to integrate the actions into the grammar. XML is unambiguous as it moves forward (so it can be made entirely out of tokens) which provides some real opportunities for improvements. It will probably be a few months before I can work out something along those lines though.

timo · 2025-02-23T00:23:38Z

While looking into #68 I found that the <xmldecl>? and <doctypedecl>? allow catastrophic backtracking behaviour when the document in total is not well-formed. I'm not sure if changing that makes the performance better for well-formed documents, but tracing what the grammar does exactly might be a good first step to improving the performance.

timo · 2025-02-26T19:15:19Z

Worked on this over in #73 a little bit.

2colours mentioned this issue Jul 22, 2022

Horroristically slow raku-community-modules/XML-Entity-HTML#2

Closed

coke mentioned this issue Jan 6, 2024

XML hangs in Web::Scraper #68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Really bad performance #63

Really bad performance #63

demanuel commented Apr 4, 2021 •

edited

Loading

jonathanstowe commented May 29, 2021

demanuel commented May 29, 2021

jonathanstowe commented May 29, 2021

2colours commented Jul 22, 2022

alabamenhu commented Feb 5, 2023

supernovus commented Feb 21, 2023

2colours commented Feb 21, 2023

alabamenhu commented Feb 22, 2023

timo commented Feb 23, 2025

timo commented Feb 26, 2025

Really bad performance #63

Really bad performance #63

Comments

demanuel commented Apr 4, 2021 • edited Loading

jonathanstowe commented May 29, 2021

demanuel commented May 29, 2021

jonathanstowe commented May 29, 2021

2colours commented Jul 22, 2022

alabamenhu commented Feb 5, 2023

supernovus commented Feb 21, 2023

2colours commented Feb 21, 2023

alabamenhu commented Feb 22, 2023

timo commented Feb 23, 2025

timo commented Feb 26, 2025

demanuel commented Apr 4, 2021 •

edited

Loading