Split off the CPU implementation #224

maleadt · 2020-07-15T13:26:01Z

Currently NNlib.jl provides both the abstract interface and a CPU implementation of its functions, which is becoming a problem now that NNlib.jl depends on LoopVectorization.jl. I think the package may need to be split in an abstract interface and a CPU implementation, e.g., like AbstractFFTs/FFTW.jl. Such a split need not hurt usability, like the FFTW example shows.

Case in point, installing CUDA.jl (which implements the NNlib.jl interfaces for use with CuArrays) requires the following additional dependencies when integrating with NNlib: CpuId, SIMDPirates, DocStringExtensions, OffsetArrays, SLEEFPirates, LoopVectorization, VectorizationBase, NNPACK_jll, UnPack. The JLL is annoying, but okay. The fact that there's so many packages however causes the time to precompile CUDA.jl to increase by a whopping 50%, from 20s to 30s, as measured with hyperfine:

hyperfine 'julia -e "Base.compilecache(Base.PkgId(Base.UUID(\"052768ef-5323-5732-b1bb-66c8b64840ba\"), \"CUDA\"))"'

I'm not familiar enough with the ML stack / NNlib.jl to figure out how that would exactly look like, but I do think we can improve things here. I'd rather not @requires the NNLib integration in CUDA.jl and lose semver tracking etc.

The text was updated successfully, but these errors were encountered:

CarloLucibello · 2020-07-15T23:00:20Z

I think we can ditch NNPACK_jll (or move it under a @require) since it is unreliable and we can move the LoopVectorization stuff into Flux (see discussion FluxML/Flux.jl#1272). I'd prefer not to create another repo to avoid the maintenance burden

maleadt · 2020-07-16T06:24:16Z

I'd prefer not to create another repo to avoid the maintenance burden

With multi-package repos, JuliaLang/Pkg.jl#1251, that should be easier than it was. Not sure how ready-to-use-that is though.

mcabbott · 2020-07-27T21:06:25Z

Agree that having NNlib stay small seems like a good idea, but I think having simple fallback / reference implementations here (just broadcasting, etc) seems worthwhile.

Having fast CPU implementations of things like softmax live somewhere that isn't Flux would be great. Flux is huge & they are more widely useful. Can they just live here under @require LoopVectorization?

CarloLucibello · 2020-08-01T07:52:55Z

This discussion has been split in many different places, e.g. discourse and FluxML/Flux.jl#1272, let's consolidate it here.

I'll lay down a first proposal for the restructuring of the FluxML org.

Proposal 1

We create a new repo has suggested in the OP, let's call it NNcore.jl with just the function definitions and no implementation for the backend vendor to extend.

## NNcore.jl
function linear_bias_act end

Flux will provide layers as a tiny interface to NNcore's primitives

## Flux.jl
using NNcore
(m::Dense)(x) = linear_bias_act(m.W, m.x, m.b, m.act)

Generic and optimized CPU implementations will live in NNlib

## NNlib.jl
using LoopVecorization
import NNcore
NNcore.linear_bias_act(W, x, b, act=identity) = act.(W * x .+ b)
NNcore.linear_bias_act(W::Array, x::Array, b::Array, act=identity) = @avx act.(W * x .+ b)

cuda specific/optimized implementations are in CUDA.jl (e.g. see Performance improvements by calling cuDNN API JuliaGPU/CUDA.jl#321)

## CUDA.jl
import NNcore
NNcore.linear_bias_act(W::CuArray, x::CuArray, b::CuArray, act=identity) = ..  cudnn call ...

Not sure where the adjoints for NNcore should go. Currently, adjoints for NNlib are in Zygote. Maybe we should port them to ChainRulesCore and move them to NNlib

Proposal 2

Not sure, maybe we could let Flux handle some of the logic, e.g. choose between optimized or non-optimized cpu implementation or provide some of the generic implementations and some cpu optimization, but now I believe this would be a less clean solution compared to Proposal 1

Proposal 3

Like Proposal 1, but we move generic implementations and cpu optimized ones to Flux itself, while living just bare method definition in NNcore, a lean dependence for CUDA.jl. This has the advantage of not having to manage yet another repo and the related complications for adding new functionality. Maybe @mcabbott won't be a huge fan of this, but to me Flux doesn't seem to be that big, and there is also some stuff that we can excise in the near future, like Datasets and data handling utilities.

The disadvantage of this is that some libraries like Knet (cc @denizyuret) are relying on NNlib's implementations, and switching their dependence to Flux could be a showstopper for them.

Proposal 4

According to @mcabbott 's suggestion, we keep things essentially as they are, except that we wrap LoopVectorization and similar non-essential dependencies insider a require. The disadvantage of this is that we lose versioning for this dependencies and we make the user experience less smooth since they will have to know that they should be using LoopVectorization for performance.

@DhairyaLGandhi @MikeInnes @mcabbott @chriselrod @ChrisRackauckas @maleadt @oxinabox @garthang

DhairyaLGandhi · 2020-08-01T08:21:20Z

Since the vectorisation is only relevant for Flux, it is much more easier to have a repo that extends NNlib with the vectorisation support and let Flux depend on that. CUDA load times would be unaffected, and we can move the NNPACK package to this new repo as well. Ideally, no code in Flux would change this way.

CarloLucibello · 2020-08-03T14:40:45Z

If we want to be a bit future proof, and deal with optimizations that are not just vectorization, as in the case of conv_bias_act of JuliaGPU/CUDA.jl#321, we will have to change stuff in Flux in any case. Also, I don't like redirecting broadcast to a map for Base functions like tanh, that is type piracy, and we can avoid that with something like proposal 1. Another point is that we may want to keep generic and cpu optimized implementations together, it seems less confusing and easier to mantain

Another option is to move most generic implementations and cpu optimized ones to Flux itself. This has the advantage of not having to manage yet another repo. I added this as Proposal 3 in previous post

mcabbott · 2020-08-03T18:59:19Z

Loading Flux means loading Zygote, there's a lot of compilation. I just timed it, 207 seconds the first time (failed to create a usable precompiled cache, because of Zygote?), 71 seconds the second, 23 seconds the third time. Why have NNlib at all if not to provide a library of functions which, while written for Flux, are useful elsewhere too? That's my argument against 3.

I was going to make a proposal 5, but I think I vote for 1 now. It's one more thing to manage, but a much simpler thing than having a FastNNlibCPU.jl, it's just like AbstractFFTs. Only package authors have to know or care about NNCore.

Both Zygote and CUDA need only depend on NNCore.

I guess NNlib could also also contains gradient definitions (via ChainRulesCore) if/when these decide to depart from Zygote. NNCore just contains funciton ∇softmax end etc. for everybody to extend.

denizyuret · 2020-08-03T19:24:00Z

I vote for 1 as well with a small change: NNcore should provide more than:

function linear_bias_act end

This tells us nothing about the API. I suggest a generic unoptimized implementation that will at least specify what is being computed and the API (parameters, keyword args, return) specification, i.e. I would put this generic Julia function in NNcore:

linear_bias_act(W, x, b, act=identity) = act.(W * x .+ b)

CPU optimized implementations can then go into NNlib.jl, GPU optimized ones can go into CUDA.jl etc.

The gradient definitions should also go into NNcore.

DhairyaLGandhi · 2020-08-04T00:32:47Z

I don't think having to change code in flux in this manner is the right way to go yet (some would obviously change, but this is a bit egregious), since that directly goes against being flexible to allow dispatch to handle picking out the right code path. The optimisations are just an implementation detail which should not need to be exposed or fiddled with to right compliant code. I am in general favour of having stubs to represent different hooks, but it is not clear at this time how many of this interacts with each other and is ultimately handled by the user.

ArrayInterface.jl is already going to support some of the things that cause this issue in the first place and while we do have to have an abstract version of NNlib, if the actual code in there is lightweight and can handle the generic/ basic definitions, which it has been prior to the introduction of the vectorisation as a dependency, we are good. The argument is to bring up a different package that applies the optimisations using the same or extended API which is what will replace the NNlib dependency in Flux.

maxfreu · 2021-02-08T14:48:00Z

Giving an example to Tim's suggestion: NNcore and NNlib could go in the same repo, just like SnoopCompile and SnoopCompileCore.

Another point to consider: Maybe it makes sense to not put all GPU implementations into CUDA, but somehow a separate repo, which can be vendor-agnostic to welcome AMD with open arms some time in the future.

CarloLucibello · 2021-02-26T07:07:56Z

Giving an example to Tim's suggestion: NNcore and NNlib could go in the same repo, just like SnoopCompile and SnoopCompileCore.

Yes, we should do this, subpackage seems the most manageable thing. A couple of links that may help us in doing it
JuliaLang/Pkg.jl#1251 (comment)
JuliaRegistries/Registrator.jl#287

DhairyaLGandhi · 2021-02-26T07:48:00Z

We have Chris Elrod already looking into the API we have, and we would consider a split then.

CarloLucibello · 2021-03-16T16:46:51Z

Closing this as we have NNlibCUDA now

CarloLucibello mentioned this issue Jul 18, 2020

use vmap for all activations #221

Open

DhairyaLGandhi mentioned this issue Jul 25, 2020

Revert LoopVectorization #226

Merged

CarloLucibello mentioned this issue Jul 26, 2020

requires for nnpack #227

Merged

PallHaraldsson mentioned this issue Sep 16, 2020

FYI: NNlib.jl; depend on it? sylvaticus/BetaML.jl#11

Closed

ToucheSir mentioned this issue Nov 15, 2020

ONNX import/export FluxML/FluxML-Community-Call-Minutes#10

Closed

CarloLucibello mentioned this issue Dec 26, 2020

Use/export LogExpFunctions.jl? #252

Open

CarloLucibello closed this as completed Mar 16, 2021

CarloLucibello mentioned this issue Mar 17, 2021

register NNlibCUDA as a subpackage of NNlib #300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split off the CPU implementation #224

Split off the CPU implementation #224

maleadt commented Jul 15, 2020

CarloLucibello commented Jul 15, 2020

maleadt commented Jul 16, 2020

mcabbott commented Jul 27, 2020

CarloLucibello commented Aug 1, 2020 •

edited

Loading

DhairyaLGandhi commented Aug 1, 2020

CarloLucibello commented Aug 3, 2020 •

edited

Loading

mcabbott commented Aug 3, 2020

denizyuret commented Aug 3, 2020

DhairyaLGandhi commented Aug 4, 2020

maxfreu commented Feb 8, 2021

CarloLucibello commented Feb 26, 2021

DhairyaLGandhi commented Feb 26, 2021

CarloLucibello commented Mar 16, 2021

Split off the CPU implementation #224

Split off the CPU implementation #224

Comments

maleadt commented Jul 15, 2020

CarloLucibello commented Jul 15, 2020

maleadt commented Jul 16, 2020

mcabbott commented Jul 27, 2020

CarloLucibello commented Aug 1, 2020 • edited Loading

Proposal 1

Proposal 2

Proposal 3

Proposal 4

DhairyaLGandhi commented Aug 1, 2020

CarloLucibello commented Aug 3, 2020 • edited Loading

mcabbott commented Aug 3, 2020

denizyuret commented Aug 3, 2020

DhairyaLGandhi commented Aug 4, 2020

maxfreu commented Feb 8, 2021

CarloLucibello commented Feb 26, 2021

DhairyaLGandhi commented Feb 26, 2021

CarloLucibello commented Mar 16, 2021

CarloLucibello commented Aug 1, 2020 •

edited

Loading

CarloLucibello commented Aug 3, 2020 •

edited

Loading