Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Review of loading time #77

Closed
tlienart opened this issue Nov 1, 2019 · 21 comments
Closed

Review of loading time #77

tlienart opened this issue Nov 1, 2019 · 21 comments

Comments

@tlienart
Copy link
Collaborator

tlienart commented Nov 1, 2019

@ablaom skip to the last message for a summary

In line with JuliaAI/MLJModels.jl#118

Initial considerations:

  • slowest package to load is ScientificTypes by a huge margin (7 seconds; all other packages take like 0.5)
  • the series of export is slower than an unrolled one

I'll keep investigating

I'm a bit confused; loading ST independently takes 0.1s; loading it with CategoricalArrays is significantly slower but still not 7s.

@tlienart
Copy link
Collaborator Author

tlienart commented Nov 1, 2019

this gives an idea:

"1-export / base using" = "1-export / base using"
time() - start = 1.0568718910217285
"2 statsbase,missings" = "2 statsbase,missings"
time() - start = 0.38069915771484375
"3 distrib" = "3 distrib"
time() - start = 1.9556851387023926
"4 st" = "4 st"
time() - start = 7.132582902908325
"5 stats,random,intutils" = "5 stats,random,intutils"
time() - start = 0.0036628246307373047
time() - start = 0.2949249744415283

step 4 is just using ScientificTypes

However taken independently, there doesn't seem to be a masssive slowdown...

@tlienart
Copy link
Collaborator Author

tlienart commented Nov 1, 2019

In a fresh session,

julia> @time using Tables, OrderedCollections, CategoricalArrays
  9.802188 seconds (19.62 M allocations: 1.367 GiB, 5.16% gc time)

julia> @time using StatsBase, Missings 
  1.672245 seconds (4.69 M allocations: 257.037 MiB, 3.42% gc time)

julia> @time using Distributions
  0.693945 seconds (735.33 k allocations: 40.466 MiB)

julia> @time using ScientificTypes
  1.124992 seconds (2.93 M allocations: 144.996 MiB, 6.22% gc time)

benchmarking stuff in a module is a bit weird I guess...

@tlienart
Copy link
Collaborator Author

tlienart commented Nov 1, 2019

seems to point out to categorical arrays being shitty

julia> @time using Tables
  0.099370 seconds (142.67 k allocations: 7.347 MiB)

julia> @time using OrderedCollections
  0.181730 seconds (470.81 k allocations: 22.469 MiB, 6.51% gc time)

julia> @time using CategoricalArrays
  9.064157 seconds (18.09 M allocations: 1.295 GiB, 4.98% gc time)

@tlienart
Copy link
Collaborator Author

tlienart commented Nov 1, 2019

There seems to be a massive discrepancy between a terminal REPL and Juno... this is in a fresh repl session (Same julia version); no problem here...

julia> @time using Tables
  0.484155 seconds (961.58 k allocations: 45.975 MiB, 1.36% gc time)

julia> @time using OrderedCollections
  0.343333 seconds (470.39 k allocations: 22.449 MiB, 2.25% gc time)

julia> @time using CategoricalArrays
  0.632599 seconds (1.15 M allocations: 70.556 MiB, 1.56% gc time)

Not sure how to analyse this.

In both the REPL and Juno, loading MLJBase takes 13-15 seconds which is way too much.

When running timings in the module, it's always the using ScientificTypes which takes too long.

@tlienart
Copy link
Collaborator Author

tlienart commented Nov 1, 2019

Right so the culprit is to be found in Requires, making everything a hard dependency in scientific types hugely decreases latency:

"1-export / base using -" = "1-export / base using -"
time() - start = 2.057216167449951
"2 statsbase,missings" = "2 statsbase,missings"
time() - start = 0.4928550720214844
"3 distrib" = "3 distrib"
time() - start = 0.8699688911437988
"4 st" = "4 st"
time() - start = 1.2298798561096191
"5 stats,random,intutils" = "5 stats,random,intutils"
time() - start = 0.008093118667602539
time() - start = 0.4565088748931885

1.2s instead of 7s

I guess we could make that even less by removing AbstractTrees which is useless anyway.

Generally there's probably choices to be made in terms of where we have optional dependencies, while it makes total sense in MLJModels, I'm less convinced it makes sense in MLJBase or MLJ and I definitely don't think it makes sense in ST.

If optional deps were overhead free, pushing for optional deps could be a good idea but it seems it's far from the cases so maybe let's add more rather than fewer hard deps and eventually when things stabilise and it's clearer what we can do, let's make stuff optional?

@tlienart tlienart changed the title Review of loading time Review of loading time -- conclusion: requires should be avoided Nov 2, 2019
@tlienart
Copy link
Collaborator Author

tlienart commented Nov 2, 2019

Ok so after a lot of head scratching and trial and error here's what I did:

  • Remove stuff which didn't truly support functionality in ST (AbstractTrees, ColorTypes, InteractiveUtils), made everything a hard dep, removed Color stuff from ST which we're not using atm anyway; removed Requires
  • In MLJBase, commented out the init part

The loading time drops from 10-15s to... 2s

So I definitely think there's some thought to be had on this; if we don't want MLJ to be shunned because it's mega slow to start we should start being smart about possibly having hard deps for things that will most likely be used anyway.

Suggestions

  • Make ST have hard deps only, remove the stuff we're not using at the moment
  • In MLJBase, make everything a hard dep too; this may mean removing CSV dependency because @time using CSV takes ~5 seconds.
  • Adjust code base in ST and MLJBase to reflect these hard-deps only

I think there's a larger discussion to be had but basically using Requires a lot should be avoided as much as possible; in MLJModels it's inevitable but elsewhere we probably should not use it; btw performance of Requires this has been a problem since 2017 apparently: JuliaPackaging/Requires.jl#39

There will probably be things that we can't do by dropping optional deps etc but I think it might constrain us in the right direction, having code that is more generic and faster to load. I think we should make a strong move in that direction ASAP, loading times of 10-30s will make people run away from the package...

Some numbers

Too slow

  • Loading CSV takes ~5s

OK

  • Loading Distributions takes 0.8 seconds
  • Loading Tables takes 0.5 seconds
  • Loading LossFunctions takes 1.5s
  • Loading StatsBase takes 0.3s
  • Loading CategoricalArrays takes 0.9s

Instant

  • Loading stdlib like Statistics, Random, InteractiveUtils
  • Distances, OrderedCollections, Parameters

So from looking at this, it would seem that the only real reason to use Requires would be for CSV; I don't really know why that would be a necessity to have it in MLJBase; other than that, using optional deps does not seem to serve a purpose.

Further than that, there may be things to shave in the way things are loaded in MLJModels (beyond requires), like the way the metadata is read etc, but I think if we sort out this first, we will already see a huge improvement, and then we can tackle more minor details.

@tlienart
Copy link
Collaborator Author

tlienart commented Nov 2, 2019

See also https://discourse.julialang.org/t/loading-time-for-2-packages-sum-of-individual-loading-time/30630

There seems to be an issue with Julia 1.3...

Loading MLJModels on Julia 1.2 takes 3s; on Julia 1.3 it takes 13s ...

@tlienart tlienart changed the title Review of loading time -- conclusion: requires should be avoided Review of loading time Nov 2, 2019
@tlienart
Copy link
Collaborator Author

tlienart commented Nov 2, 2019

See also this issue: JuliaLang/julia#33615

It may be that when this is fixed, the recommendations here can essentially be ignored.

@tlienart
Copy link
Collaborator Author

takes 3s now 👍

@cstjean
Copy link

cstjean commented Aug 11, 2020

On Julia 1.5, even when precompiled,

julia> @time using MLJBase
 12.497485 seconds (22.42 M allocations: 1.155 GiB, 5.92% gc time)

pkg> st MLJBase
Status `~/Programa/Project.toml`
  [a7f614a8] MLJBase v0.14.6

😞 Is there any plan to move some of the bigger dependencies (HTTP, JLSO?) to MLJ? Using MLJBase in my package makes ]test much longer than it otherwise would be, and that's a shame.

@tlienart tlienart reopened this Aug 11, 2020
@tlienart
Copy link
Collaborator Author

I think we should drop HTTP and the openML stuff, that should be a separate package; I don't know what JLSO is.

Then there's this plan of taking the measures outside of MLJBase which would take out LossFunctions too...

@cstjean
Copy link

cstjean commented Aug 11, 2020

JLSO.jl is part of MLBase: https://github.com/alan-turing-institute/MLJBase.jl/blob/master/Project.toml#L16

It has a lot of dependencies

(TestEnv) pkg> add JLSO
   Updating registry at `~/.julia/registries/General`
######################################################################## 100.0%
  Resolving package versions...
Updating `~/Programa/TestEnv/Project.toml`
  [9da8a3cd] + JLSO v2.3.2
Updating `~/Programa/TestEnv/Manifest.toml`
  [fbb218c0] + BSON v0.2.6
  [944b1d66] + CodecZlib v0.7.0
  [e2ba6199] + ExprTools v0.1.1
  [8f5d6c58] + EzXML v1.1.0
  [48062228] + FilePathsBase v0.9.4
  [9da8a3cd] + JLSO v2.3.2
  [682c06a0] + JSON v0.21.0
  [94ce4f54] + Libiconv_jll v1.16.0+5
  [f28f55f0] + Memento v1.1.0
  [78c3b35d] + Mocking v0.7.1
  [69de0a69] + Parsers v1.0.10
  [3cdcf5f2] + RecipesBase v1.0.2
  [cea106d9] + Syslogs v0.3.0
  [f269a46b] + TimeZones v1.3.2
  [3bb67fe8] + TranscodingStreams v0.9.5
  [02c8fc9c] + XML2_jll v2.9.10+1
  [83775a58] + Zlib_jll v1.2.11+15

@tlienart
Copy link
Collaborator Author

@ablaom what do you think? Should this serialisation business be either done differently so that it depends on fewer things or otherwise extracted from MLJBase? I've not looked at that part of the code at all so no idea what's reasonable

@ablaom
Copy link
Member

ablaom commented Aug 16, 2020

I agree it makes sense to remove one or both of model serialisation and OpenML into new packages, if this is impacting on the load times.

Using MLJBase in my package makes ]test much longer than it otherwise would be, and that's a shame.

@cstjean Just curious, what package are you using that depends on MLJBase?

@ablaom
Copy link
Member

ablaom commented Aug 16, 2020

Yes, just confirming JLSO has a huge impact. Times with or without pre-compilation are roughly halved by dumping JLSO.

cc JLSO author @oxinabox may want to comment, but I think we go ahead and move serialisation out.

(Just FYI, JLSO is the fallback serialisation method. Models can implement custom serialisation. )

Thanks @cstjean for pointing this one out!

@ablaom
Copy link
Member

ablaom commented Aug 16, 2020

The alternative is to use julia native serialisation as the fallback. Actually, this is essentially what is happening - by default JLSO uses native serialisation. However, it adds a wrapper that records the julia version, the state of packages, and so forth, which is kind of nice for reproducibility.

@oxinabox
Copy link
Contributor

@rofinn we should look into what is making JLSO have such an impact on load-times.
Its probably hurting out other projects that also use it.
(and it might be traceable to another package we use a lot like Memento or TimeZones.jl which we should really deal with)

@rofinn
Copy link

rofinn commented Aug 18, 2020

Most of the JLSO load time is from Memento's __init__ function. invenia/Memento.jl#163

@rofinn
Copy link

rofinn commented Aug 18, 2020

The PR for a partial fix is open invenia/Memento.jl#164

@ablaom
Copy link
Member

ablaom commented Aug 18, 2020

@rofinn @oxinabox Thanks indeed for looking into this.

@ablaom
Copy link
Member

ablaom commented Jun 6, 2021

Update:

As far as I can tell, the only issues brought up here left unresolved and not tracked elsewhere is the surprisingly long load time for LossFunctions.jl. So I am closing this now in favour of #570

@ablaom ablaom closed this as completed Jun 6, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants