-
-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Split off the CPU implementation #224
Comments
I think we can ditch NNPACK_jll (or move it under a |
With multi-package repos, JuliaLang/Pkg.jl#1251, that should be easier than it was. Not sure how ready-to-use-that is though. |
Agree that having NNlib stay small seems like a good idea, but I think having simple fallback / reference implementations here (just broadcasting, etc) seems worthwhile. Having fast CPU implementations of things like |
This discussion has been split in many different places, e.g. discourse and FluxML/Flux.jl#1272, let's consolidate it here. I'll lay down a first proposal for the restructuring of the FluxML org. Proposal 1
## NNcore.jl
function linear_bias_act end
## Flux.jl
using NNcore
(m::Dense)(x) = linear_bias_act(m.W, m.x, m.b, m.act)
## NNlib.jl
using LoopVecorization
import NNcore
NNcore.linear_bias_act(W, x, b, act=identity) = act.(W * x .+ b)
NNcore.linear_bias_act(W::Array, x::Array, b::Array, act=identity) = @avx act.(W * x .+ b)
## CUDA.jl
import NNcore
NNcore.linear_bias_act(W::CuArray, x::CuArray, b::CuArray, act=identity) = .. cudnn call ...
Proposal 2Not sure, maybe we could let Flux handle some of the logic, e.g. choose between optimized or non-optimized cpu implementation or provide some of the generic implementations and some cpu optimization, but now I believe this would be a less clean solution compared to Proposal 1 Proposal 3Like Proposal 1, but we move generic implementations and cpu optimized ones to Flux itself, while living just bare method definition in NNcore, a lean dependence for CUDA.jl. This has the advantage of not having to manage yet another repo and the related complications for adding new functionality. Maybe @mcabbott won't be a huge fan of this, but to me Flux doesn't seem to be that big, and there is also some stuff that we can excise in the near future, like Datasets and data handling utilities. The disadvantage of this is that some libraries like Knet (cc @denizyuret) are relying on NNlib's implementations, and switching their dependence to Flux could be a showstopper for them. Proposal 4According to @mcabbott 's suggestion, we keep things essentially as they are, except that we wrap LoopVectorization and similar non-essential dependencies insider a @DhairyaLGandhi @MikeInnes @mcabbott @chriselrod @ChrisRackauckas @maleadt @oxinabox @garthang |
Since the vectorisation is only relevant for Flux, it is much more easier to have a repo that extends NNlib with the vectorisation support and let Flux depend on that. CUDA load times would be unaffected, and we can move the NNPACK package to this new repo as well. Ideally, no code in Flux would change this way. |
If we want to be a bit future proof, and deal with optimizations that are not just vectorization, as in the case of Another option is to move most generic implementations and cpu optimized ones to Flux itself. This has the advantage of not having to manage yet another repo. I added this as Proposal 3 in previous post |
Loading Flux means loading Zygote, there's a lot of compilation. I just timed it, 207 seconds the first time (failed to create a usable precompiled cache, because of Zygote?), 71 seconds the second, 23 seconds the third time. Why have NNlib at all if not to provide a library of functions which, while written for Flux, are useful elsewhere too? That's my argument against 3. I was going to make a proposal 5, but I think I vote for 1 now. It's one more thing to manage, but a much simpler thing than having a FastNNlibCPU.jl, it's just like AbstractFFTs. Only package authors have to know or care about NNCore. Both Zygote and CUDA need only depend on NNCore. I guess NNlib could also also contains gradient definitions (via ChainRulesCore) if/when these decide to depart from Zygote. NNCore just contains |
I vote for 1 as well with a small change: NNcore should provide more than:
This tells us nothing about the API. I suggest a generic unoptimized implementation that will at least specify what is being computed and the API (parameters, keyword args, return) specification, i.e. I would put this generic Julia function in NNcore:
CPU optimized implementations can then go into NNlib.jl, GPU optimized ones can go into CUDA.jl etc. The gradient definitions should also go into NNcore. |
I don't think having to change code in flux in this manner is the right way to go yet (some would obviously change, but this is a bit egregious), since that directly goes against being flexible to allow dispatch to handle picking out the right code path. The optimisations are just an implementation detail which should not need to be exposed or fiddled with to right compliant code. I am in general favour of having stubs to represent different hooks, but it is not clear at this time how many of this interacts with each other and is ultimately handled by the user. ArrayInterface.jl is already going to support some of the things that cause this issue in the first place and while we do have to have an abstract version of NNlib, if the actual code in there is lightweight and can handle the generic/ basic definitions, which it has been prior to the introduction of the vectorisation as a dependency, we are good. The argument is to bring up a different package that applies the optimisations using the same or extended API which is what will replace the NNlib dependency in Flux. |
Giving an example to Tim's suggestion: NNcore and NNlib could go in the same repo, just like SnoopCompile and SnoopCompileCore. Another point to consider: Maybe it makes sense to not put all GPU implementations into CUDA, but somehow a separate repo, which can be vendor-agnostic to welcome AMD with open arms some time in the future. |
Yes, we should do this, subpackage seems the most manageable thing. A couple of links that may help us in doing it |
We have Chris Elrod already looking into the API we have, and we would consider a split then. |
Closing this as we have NNlibCUDA now |
Currently NNlib.jl provides both the abstract interface and a CPU implementation of its functions, which is becoming a problem now that NNlib.jl depends on LoopVectorization.jl. I think the package may need to be split in an abstract interface and a CPU implementation, e.g., like AbstractFFTs/FFTW.jl. Such a split need not hurt usability, like the FFTW example shows.
Case in point, installing CUDA.jl (which implements the NNlib.jl interfaces for use with CuArrays) requires the following additional dependencies when integrating with NNlib: CpuId, SIMDPirates, DocStringExtensions, OffsetArrays, SLEEFPirates, LoopVectorization, VectorizationBase, NNPACK_jll, UnPack. The JLL is annoying, but okay. The fact that there's so many packages however causes the time to precompile CUDA.jl to increase by a whopping 50%, from 20s to 30s, as measured with
hyperfine
:I'm not familiar enough with the ML stack / NNlib.jl to figure out how that would exactly look like, but I do think we can improve things here. I'd rather not
@requires
the NNLib integration in CUDA.jl and lose semver tracking etc.The text was updated successfully, but these errors were encountered: