Skip to content
This repository was archived by the owner on Mar 12, 2021. It is now read-only.

Memory management for libraries #130

Closed
maleadt opened this issue Sep 3, 2018 · 8 comments · Fixed by #436
Closed

Memory management for libraries #130

maleadt opened this issue Sep 3, 2018 · 8 comments · Fixed by #436

Comments

@maleadt
Copy link
Member

maleadt commented Sep 3, 2018

Libraries like cuFFT also perform allocations, which might fail due to outstanding references (ie. why we introduced the memory pool). See https://discourse.julialang.org/t/cuarray-and-optim/14053/7

Maybe we should try and generalize the pooling memory allocator for library allocations. On the other hand, these allocations are often non-reusable (eg. cufftXtMalloc is plan-specific), maybe we should just lift out the malloc/gc/reclaim logic.

Specific to FFT, we should probably split cuPlan1D into manual plan creation + alloc to make this possible.

@RainerHeintzmann
Copy link

Just to make sure there is no misunderstanding: I received the memory (FFT-PLAN) allocation errors even though I was only ever using a single 2D array 256x256. So one would naively think that only a single forward and backward plan really would need to be allocated, yet I almost always get memory allocation errors.

@maleadt
Copy link
Member Author

maleadt commented Sep 13, 2018

I received the memory (FFT-PLAN) allocation errors even though I was only ever using a single 2D array 256x256

But you did have some arrays allocated, right? The GPU shares its memory between cuFFT plans and other allocations, and we keep a pool of regular allocations around that could interfere with new cuFFT plan creations. You can inspect the size of those pools as such: sum(getfield.(collect(Base.Iterators.flatten([CuArrays.pools_used..., CuArrays.pools_avail...])), :bytesize)). If that doesn't show a large number, please provide an MWE and I'll have a closer look.

@RainerHeintzmann
Copy link

The following code (starting Julia from scratch):

using CuArrays, FFTW
CuArrays.allowscalar(false);
A = cu(randn(256,256));
for n in 1:1000
    global A
        A=ifft(fft(A));
end

yields:

ERROR: CUFFTError(code 1, cuFFT was passed an invalid plan handle)
Stacktrace:
 [1] _mkplan(::UInt8, ::Tuple{Int64,Int64}, ::UnitRange{Int64}) at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\CUFFT.jl:112
 [2] plan_bfft at C:\Users\pi96doc\.julia\packages\CuArrays\F96Gk\src\fft\CUFFT.jl:394 [inlined]
 [3] #plan_ifft#15 at C:\Users\pi96doc\.julia\packages\AbstractFFTs\7WCaR\src\definitions.jl:268 [inlined]
 [4] plan_ifft at C:\Users\pi96doc\.julia\packages\AbstractFFTs\7WCaR\src\definitions.jl:268 [inlined]
 [5] #plan_ifft#3 at C:\Users\pi96doc\.julia\packages\AbstractFFTs\7WCaR\src\definitions.jl:52 [inlined]
 [6] plan_ifft at C:\Users\pi96doc\.julia\packages\AbstractFFTs\7WCaR\src\definitions.jl:52 [inlined]
 [7] ifft(::CuArray{Complex{Float32},2}) at C:\Users\pi96doc\.julia\packages\AbstractFFTs\7WCaR\src\definitions.jl:50
 [8] top-level scope at .\none:5

on a Windows 10 system with cuda 9.0 and cuda 9.1 installed. With only 100 iterations it runs fine.
If I run your line (starting with sum) after startup, I get an error, even after the using commands. However, if I run it after the first allocation (A = cu(randn(256,256))) I obtain
262144
after 100 iterations I get:
106168320
after unsuccessfully attempting 1000, I get:
53818163

@maleadt
Copy link
Member Author

maleadt commented Sep 18, 2018

Had a quick look, and the cause is the workspace memory that is left hanging around after multiple cufftPlan1D calls. Quick fix: call GC.gc()

Better fix, see https://docs.nvidia.com/cuda/cufft/index.html#multiple-GPU-cufft-transforms, do our own workspace allocations as regular CuArrays. Those are backed by the memory pool.

@RainerHeintzmann
Copy link

I am not sure I understand what you are suggesting. Wouldn't it make a lot of sense if the existing plans are not always destroyed but reused? I think plan creation is actually relatively slow, compared to FFTs of smaller arrays. How can I do my own "workspace allocation"? I guess I would need to learn how to call the core Cuda-routines from within Julia and then copy the julia-allocated arrays to these standard Cuda arrays? Any help is appreciated.

@wsphillips
Copy link

wsphillips commented Jan 16, 2019

Allocation problems with cuFFT plans are addressed by storing the plan to a variable and later calling destroy_plan(plan) after use, which is available in CuArrays.cuFFT but not documented. This works even though GC.gc() doesn't.

Each call for a FFT allocates workspace memory needed to complete the computation. If you don't have a handle to the plan its invisible to julia and just piles up on the device side. Note that you can reuse a plan for the same size FFT/batch repeatedly.

See here for an example similar to yours, as well as other syntax for tighter control of memory when running lots of computations like this.

Edit: cufftGetSize() cufftSetAutoAllocation() cufftSetWorkArea() may be useful too

@leios
Copy link

leios commented Jul 31, 2019

I did a bit of poking around for this issue, specifically for the cufft implementations. I was unable to confirm that multiple fft calls will accrue a large amount of memory due to the cufft plans; however, I may not have tested this correctly. What I did confirm was that the only way to prevent the error was by creating an in-place plan and using that plan as plan*a.

Here were the tests I ran:

using CuArrays, CUDAnative, FFTW

# test to confirm the issue with GPU memory for cufft library
function cufft_test(n)
    a = CuArray(convert.(Complex{Float64}, ones(100,100)))
    for i = 1:n
        # Both fft() and fft!() have this issue
        fft!(a)
    end
end

# this function does nto allocate any more memory than necessary
function cufft_plan_test(n)
    a = CuArray(convert.(Complex{Float64}, ones(100,100)))
    plan = plan_fft!(a)
    for i = 1:n
        plan*a
    end
end

# an attempt at seeing if julia allocates too much memory for the plans,
# themselves. Note: cufft plans are ints, so this should be fair?
function cufft_plan_mem_test(n)
    a = CuArray(convert.(Complex{Float64}, ones(1)))
    for i = 1:n
        a = fft(a)
    end
end

Here is the code for the fft!() function:

to1(x::AbstractArray) = _to1(axes(x), x)
_to1(::Tuple{Base.OneTo,Vararg{Base.OneTo}}, x) = x
_to1(::Tuple, x) = copy1(eltype(x), x)

# implementations only need to provide plan_X(x, region)
# for X in (:fft, :bfft, ...):
for f in (:fft, :bfft, :ifft, :fft!, :bfft!, :ifft!, :rfft)
    pf = Symbol("plan_", f)
    @eval begin
        $f(x::AbstractArray) = (y = to1(x); $pf(y) * y)
        $f(x::AbstractArray, region) = (y = to1(x); $pf(y, region) * y)
        $pf(x::AbstractArray; kws...) = (y = to1(x); $pf(y, 1:ndims(y); kws...))
    end
end

The only difference is the copy(), so it seems like we are just reserving memory for the fft that is never freed in the end. Is this correct? EDIT: incorrect upon further reflection. The memory usage seems to be coming from the plan generation and is correlated to the issue, but not the cause of it.

If so, we could probably find a solution by reserving memory equal to the size of the arrays we will be fft'ing and using that instead of copying every time... I'll keep poking around a bit more.

@benchislett
Copy link

I have done some digging around this problem, specifically the CuFFT error #1, and I think I've found the roots of the problem.

To begin, I tested the same fft operation in CUDA/C. Here's the code:

#include <cuda.h>
#include <cufft.h>
#include <stdio.h>

long N = 1*1;

void checkCufft(int res, const char *msg, int iter) {
  if (res != CUFFT_SUCCESS) {
    printf("CUFFT Error, Code %d, Message: %s, Iteration: %d\n", res, msg, iter);
    exit(res);
  }
}

int main() {

  cufftHandle plan;
  cufftComplex *data;

  cudaMalloc((void**)&data, N * sizeof(cufftComplex));

  for (int i = 0; i < 1025; i++) {
    checkCufft(cufftPlan1d(&plan, N, CUFFT_C2C, 1), "Plan", i);
    checkCufft(cufftExecC2C(plan, data, data, CUFFT_FORWARD), "Exec", i);
  }
  cudaDeviceSynchronize();

  cufftDestroy(plan);
  cudaFree(data);

  printf("Program successful!\n");
}

When running this, you will find that, consistently, the plan creation operation fails with error code 1 exactly on the 1024th iteration. Also, this failure is completely independent of N. Regardless of the size, the error occurs when 1024 plans exist at once. When the cufftDestroy(plan); call is moved into the loop, the execution works perfectly (unless N is too large and you run out of memory)

Because of this, I believe that there is an implicit limit in CuFFT on the number of plans you can have at one, and this is causing the error. This is clearly not intentional, since this boundary is not mentioned in the documentation, and because cufftPlan1d isn't even supposed to return that error code at all

The reason that Julia is different is because of the slow garbage detection not causing these plans to be released quickly enough. Let's look closely at the implementation of the FFTs:

(Side note: the example given by @leios works because AbstractFFT converts 1D arrays to cpu arrays, so a 1D CuArray does not have this problem)

The key line is here:
$f(x::AbstractArray) = (y = to1(x); $pf(y) * y)
I'll refactor for readability:

function fft(x)
    plan_fft(x) * x
end

And as mentioned by @wsphillips, "Allocation problems with cuFFT plans are addressed by storing the plan to a variable and later calling destroy_plan(plan)"
However, this destruction clearly does not happen. Rather, it is declared in CuArrays.jl/src/fft/fft.jl:

mutable struct cCuFFTPlan{...}
    ...

    function cCuFFTPlan{...}(...)
        p = new(...)
        finalizer(destroy_plan, p)
        p
    end
end

The finalizer function destroy_plan is supposed to be run when p goes out of scope, and yet we have the error. This is either because destroy_plan is slow to release the plan, or because the garbage collector is slow to realize it should execute the finalizer (or both). We test with some simple julia code in the repl:

using CuArrays, FFTW
a = CuArray(ones(Complex{Float64}, 100, 100))

# Control: should fail
for i=1:10000
    fft(a)
end

# Should fail because of the plan limit mentioned above
for i=1:10000
    plan = plan_fft(a)
    # plan * a # fails with or without this line
end

# Will succeed if it is the GC's fault for not finalizing fast enough
for i=1:10000
    plan = plan_fft(a)
    plan * a
    finalize(plan) # Equivalently, CuArrays.CUFFT.destroy_plan(plan)
end

When run, the last case passes. As such, we can see that it is the fact that the GC can't keep up with all the plans that causes the error.

We could in theory remove the fft definition from AbstractFFTs altogether, and let each library define the FFT itself, allowing CuArrays to destroy the plan in their own FFT methods.
I think the best solution though, is just to call the finalizer on the plan in AbstractFFTs. This would only require a minor refactor to get the return values working, but if anything (in my opinion) that would be a great readability improvement on its own.

What do you guys think?
Could there be a better way of getting the GC to pick up on what's happening?

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants