Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Really bad performance #63

Open
demanuel opened this issue Apr 4, 2021 · 8 comments
Open

Really bad performance #63

demanuel opened this issue Apr 4, 2021 · 8 comments

Comments

@demanuel
Copy link

demanuel commented Apr 4, 2021

Hi!

I'm trying to parse a 52 MByte XML file and the performance is really bad.

I'm trying to follow the instructions and just doing:

my $XML = 'ec_inventory_en.xml';
sub MAIN(){
    my $xml = from-xml-file($XML);
}

This code will use more than 5Gbytes of memory [1], only one core is used [2] and it takes more than 3m30s (in comparison a perl version takes around 15 seconds to parse the file)

[1] - Reported by cat /proc/$PID/smaps | grep -i pss | awk '{Total+=$2} END {print Total/1024" MB"}'
[2] - htop image

@jonathanstowe
Copy link
Member

Hi,
This module is written in pure Raku and is likely to suffer in comparison to parsers in other languages that may use a C library to do the same thing.

If you are concerned about performance then you may want to consider the LibXML binding https://modules.raku.org/dist/LibXML:cpan:WARRINGD

@demanuel
Copy link
Author

I understand that, but taking 3m30s and 5GiB just to load a 50MByte file.... Is not OK.

@jonathanstowe
Copy link
Member

Sure,
My intuition is that the memory usage is due to a very large number of objects, and the speed is due to the allocation of those objects. I guess someone could profile the code to determine if there are unnecessary objects being created and retained.

Another line of attack would be to profile the performance of the Grammar on its own (that is without the action that builds the object tree,) and see if any improvements can be made there.

There are probably some micro-optimisations that could be made within the code but you probably wouldn't want to start on that until profiling has revealed the places that would be of benefit.

@2colours
Copy link
Contributor

It's good to see it's not just me. I wanted to use XML::Entity::HTML that depends on this module. Turns out that out of 160 seconds of rendering a 1MB HTML on 4 cores, 140 went straight to escaping tags, when the named HTML barely had any tags to escape in the first place!

@alabamenhu
Copy link

This happens because of the way this module was structured. On the one hand, it's a great example of some very cool Raku features but… they're also ones that haven't been very well optimized. Lots of but, Proxy, runtime references like ::('Foo') etc, that will slow stuff down substantially.

Won't help with memory, but should help with speed: most classes will use method-call syntax for attributes ($.foo) when they really can reference the attribute directly ($!foo). I don't think the optimizer is smart enough to optimize that away.

This can almost certainly be optimized by a LOT, but I'm not sure if it can be done while maintaining 100% backwards compatibility. Guess I'll give it a try.

@supernovus
Copy link
Collaborator

It's certainly mostly my fault the XML module is slow as molasses in January. When I first worked on this in 2010, I wanted to try using all of the cool language features that had drawn me to Raku in the first place, and had more emphasis on that than on performance. Subsequent updates only focused on trying even more cool new features.

I'd planned on eventually writing an add-on extension using LibXML2 bindings (I see there is at least one module doing that now) but using the simpler API this module provides.

All of the amazing developers who have worked on this since I abandoned my Raku libraries a decade ago, have improved it substantially, and they are all saints for working on the convoluted codebase I left behind.

@2colours
Copy link
Contributor

@supernovus don't worry - it's really much better that you have left a lot of stuff to work with/on, than just silently abandoning them. Also, the code isn't all that bad really... when I started looking into it, what struck me is the "builder pattern" everywhere. I thought that would be an immediate and straight-forward place for improvements - but then @alabamenhu started actually making changes and reported that there aren't really easy gains with an eager system - in which case I also wouldn't say it's really your fault.

@alabamenhu are you planning to adopt this module, by the way? Not pressuring you in either direction, just curious. And if you don't, maybe your changes could be merged back into this repo, with a new version published perhaps.

@alabamenhu
Copy link

@alabamenhu are you planning to adopt this module, by the way? Not pressuring you in either direction, just curious. And if you don't, maybe your changes could be merged back into this repo, with a new version published perhaps.

Not sure if I'll adopt it per se, but I'll see what I can do to work with it. One important thing to consider here: this is a pure Raku module, and that has major value even if a wrapper for LibXML would be faster. There's no guarantee that LibXML will be available on any given system, so a fully vanilla Raku module is a good thing.

One thing that MIGHT be faster potentially is what I did for parsing number format strings in Intl::Format::Number, which is to integrate the actions into the grammar. XML is unambiguous as it moves forward (so it can be made entirely out of tokens) which provides some real opportunities for improvements. It will probably be a few months before I can work out something along those lines though.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants