Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Split off the CPU implementation #224

Closed
maleadt opened this issue Jul 15, 2020 · 13 comments
Closed

Split off the CPU implementation #224

maleadt opened this issue Jul 15, 2020 · 13 comments

Comments

@maleadt
Copy link
Contributor

maleadt commented Jul 15, 2020

Currently NNlib.jl provides both the abstract interface and a CPU implementation of its functions, which is becoming a problem now that NNlib.jl depends on LoopVectorization.jl. I think the package may need to be split in an abstract interface and a CPU implementation, e.g., like AbstractFFTs/FFTW.jl. Such a split need not hurt usability, like the FFTW example shows.

Case in point, installing CUDA.jl (which implements the NNlib.jl interfaces for use with CuArrays) requires the following additional dependencies when integrating with NNlib: CpuId, SIMDPirates, DocStringExtensions, OffsetArrays, SLEEFPirates, LoopVectorization, VectorizationBase, NNPACK_jll, UnPack. The JLL is annoying, but okay. The fact that there's so many packages however causes the time to precompile CUDA.jl to increase by a whopping 50%, from 20s to 30s, as measured with hyperfine:

hyperfine 'julia -e "Base.compilecache(Base.PkgId(Base.UUID(\"052768ef-5323-5732-b1bb-66c8b64840ba\"), \"CUDA\"))"'

I'm not familiar enough with the ML stack / NNlib.jl to figure out how that would exactly look like, but I do think we can improve things here. I'd rather not @requires the NNLib integration in CUDA.jl and lose semver tracking etc.

@CarloLucibello
Copy link
Member

I think we can ditch NNPACK_jll (or move it under a @require) since it is unreliable and we can move the LoopVectorization stuff into Flux (see discussion FluxML/Flux.jl#1272). I'd prefer not to create another repo to avoid the maintenance burden

@maleadt
Copy link
Contributor Author

maleadt commented Jul 16, 2020

I'd prefer not to create another repo to avoid the maintenance burden

With multi-package repos, JuliaLang/Pkg.jl#1251, that should be easier than it was. Not sure how ready-to-use-that is though.

@mcabbott
Copy link
Member

Agree that having NNlib stay small seems like a good idea, but I think having simple fallback / reference implementations here (just broadcasting, etc) seems worthwhile.

Having fast CPU implementations of things like softmax live somewhere that isn't Flux would be great. Flux is huge & they are more widely useful. Can they just live here under @require LoopVectorization?

@CarloLucibello
Copy link
Member

CarloLucibello commented Aug 1, 2020

This discussion has been split in many different places, e.g. discourse and FluxML/Flux.jl#1272, let's consolidate it here.

I'll lay down a first proposal for the restructuring of the FluxML org.

Proposal 1

  • We create a new repo has suggested in the OP, let's call it NNcore.jl with just the function definitions and no implementation for the backend vendor to extend.
## NNcore.jl
function linear_bias_act end
  • Flux will provide layers as a tiny interface to NNcore's primitives
## Flux.jl
using NNcore
(m::Dense)(x) = linear_bias_act(m.W, m.x, m.b, m.act)
  • Generic and optimized CPU implementations will live in NNlib
## NNlib.jl
using LoopVecorization
import NNcore
NNcore.linear_bias_act(W, x, b, act=identity) = act.(W * x .+ b)
NNcore.linear_bias_act(W::Array, x::Array, b::Array, act=identity) = @avx act.(W * x .+ b)
## CUDA.jl
import NNcore
NNcore.linear_bias_act(W::CuArray, x::CuArray, b::CuArray, act=identity) = ..  cudnn call ...
  • Not sure where the adjoints for NNcore should go. Currently, adjoints for NNlib are in Zygote. Maybe we should port them to ChainRulesCore and move them to NNlib

Proposal 2

Not sure, maybe we could let Flux handle some of the logic, e.g. choose between optimized or non-optimized cpu implementation or provide some of the generic implementations and some cpu optimization, but now I believe this would be a less clean solution compared to Proposal 1

Proposal 3

Like Proposal 1, but we move generic implementations and cpu optimized ones to Flux itself, while living just bare method definition in NNcore, a lean dependence for CUDA.jl. This has the advantage of not having to manage yet another repo and the related complications for adding new functionality. Maybe @mcabbott won't be a huge fan of this, but to me Flux doesn't seem to be that big, and there is also some stuff that we can excise in the near future, like Datasets and data handling utilities.

The disadvantage of this is that some libraries like Knet (cc @denizyuret) are relying on NNlib's implementations, and switching their dependence to Flux could be a showstopper for them.

Proposal 4

According to @mcabbott 's suggestion, we keep things essentially as they are, except that we wrap LoopVectorization and similar non-essential dependencies insider a require. The disadvantage of this is that we lose versioning for this dependencies and we make the user experience less smooth since they will have to know that they should be using LoopVectorization for performance.

@DhairyaLGandhi @MikeInnes @mcabbott @chriselrod @ChrisRackauckas @maleadt @oxinabox @garthang

@DhairyaLGandhi
Copy link
Member

Since the vectorisation is only relevant for Flux, it is much more easier to have a repo that extends NNlib with the vectorisation support and let Flux depend on that. CUDA load times would be unaffected, and we can move the NNPACK package to this new repo as well. Ideally, no code in Flux would change this way.

@CarloLucibello
Copy link
Member

CarloLucibello commented Aug 3, 2020

If we want to be a bit future proof, and deal with optimizations that are not just vectorization, as in the case of conv_bias_act of JuliaGPU/CUDA.jl#321, we will have to change stuff in Flux in any case. Also, I don't like redirecting broadcast to a map for Base functions like tanh, that is type piracy, and we can avoid that with something like proposal 1. Another point is that we may want to keep generic and cpu optimized implementations together, it seems less confusing and easier to mantain

Another option is to move most generic implementations and cpu optimized ones to Flux itself. This has the advantage of not having to manage yet another repo. I added this as Proposal 3 in previous post

@mcabbott
Copy link
Member

mcabbott commented Aug 3, 2020

Loading Flux means loading Zygote, there's a lot of compilation. I just timed it, 207 seconds the first time (failed to create a usable precompiled cache, because of Zygote?), 71 seconds the second, 23 seconds the third time. Why have NNlib at all if not to provide a library of functions which, while written for Flux, are useful elsewhere too? That's my argument against 3.

I was going to make a proposal 5, but I think I vote for 1 now. It's one more thing to manage, but a much simpler thing than having a FastNNlibCPU.jl, it's just like AbstractFFTs. Only package authors have to know or care about NNCore.

Both Zygote and CUDA need only depend on NNCore.

I guess NNlib could also also contains gradient definitions (via ChainRulesCore) if/when these decide to depart from Zygote. NNCore just contains funciton ∇softmax end etc. for everybody to extend.

@denizyuret
Copy link

I vote for 1 as well with a small change: NNcore should provide more than:

function linear_bias_act end

This tells us nothing about the API. I suggest a generic unoptimized implementation that will at least specify what is being computed and the API (parameters, keyword args, return) specification, i.e. I would put this generic Julia function in NNcore:

linear_bias_act(W, x, b, act=identity) = act.(W * x .+ b)

CPU optimized implementations can then go into NNlib.jl, GPU optimized ones can go into CUDA.jl etc.

The gradient definitions should also go into NNcore.

@DhairyaLGandhi
Copy link
Member

I don't think having to change code in flux in this manner is the right way to go yet (some would obviously change, but this is a bit egregious), since that directly goes against being flexible to allow dispatch to handle picking out the right code path. The optimisations are just an implementation detail which should not need to be exposed or fiddled with to right compliant code. I am in general favour of having stubs to represent different hooks, but it is not clear at this time how many of this interacts with each other and is ultimately handled by the user.

ArrayInterface.jl is already going to support some of the things that cause this issue in the first place and while we do have to have an abstract version of NNlib, if the actual code in there is lightweight and can handle the generic/ basic definitions, which it has been prior to the introduction of the vectorisation as a dependency, we are good. The argument is to bring up a different package that applies the optimisations using the same or extended API which is what will replace the NNlib dependency in Flux.

@maxfreu
Copy link
Contributor

maxfreu commented Feb 8, 2021

Giving an example to Tim's suggestion: NNcore and NNlib could go in the same repo, just like SnoopCompile and SnoopCompileCore.

Another point to consider: Maybe it makes sense to not put all GPU implementations into CUDA, but somehow a separate repo, which can be vendor-agnostic to welcome AMD with open arms some time in the future.

@CarloLucibello
Copy link
Member

Giving an example to Tim's suggestion: NNcore and NNlib could go in the same repo, just like SnoopCompile and SnoopCompileCore.

Yes, we should do this, subpackage seems the most manageable thing. A couple of links that may help us in doing it
JuliaLang/Pkg.jl#1251 (comment)
JuliaRegistries/Registrator.jl#287

@DhairyaLGandhi
Copy link
Member

We have Chris Elrod already looking into the API we have, and we would consider a split then.

@CarloLucibello
Copy link
Member

Closing this as we have NNlibCUDA now

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants