-
-
Notifications
You must be signed in to change notification settings - Fork 80
Memory management for libraries #130
Comments
Just to make sure there is no misunderstanding: I received the memory (FFT-PLAN) allocation errors even though I was only ever using a single 2D array 256x256. So one would naively think that only a single forward and backward plan really would need to be allocated, yet I almost always get memory allocation errors. |
But you did have some arrays allocated, right? The GPU shares its memory between cuFFT plans and other allocations, and we keep a pool of regular allocations around that could interfere with new cuFFT plan creations. You can inspect the size of those pools as such: |
The following code (starting Julia from scratch):
yields:
on a Windows 10 system with cuda 9.0 and cuda 9.1 installed. With only 100 iterations it runs fine. |
Had a quick look, and the cause is the workspace memory that is left hanging around after multiple Better fix, see https://docs.nvidia.com/cuda/cufft/index.html#multiple-GPU-cufft-transforms, do our own workspace allocations as regular |
I am not sure I understand what you are suggesting. Wouldn't it make a lot of sense if the existing plans are not always destroyed but reused? I think plan creation is actually relatively slow, compared to FFTs of smaller arrays. How can I do my own "workspace allocation"? I guess I would need to learn how to call the core Cuda-routines from within Julia and then copy the julia-allocated arrays to these standard Cuda arrays? Any help is appreciated. |
Allocation problems with cuFFT plans are addressed by storing the plan to a variable and later calling Each call for a FFT allocates workspace memory needed to complete the computation. If you don't have a handle to the plan its invisible to julia and just piles up on the device side. Note that you can reuse a plan for the same size FFT/batch repeatedly. See here for an example similar to yours, as well as other syntax for tighter control of memory when running lots of computations like this. Edit: cufftGetSize() cufftSetAutoAllocation() cufftSetWorkArea() may be useful too |
I did a bit of poking around for this issue, specifically for the cufft implementations. I was unable to confirm that multiple fft calls will accrue a large amount of memory due to the cufft plans; however, I may not have tested this correctly. What I did confirm was that the only way to prevent the error was by creating an in-place plan and using that plan as Here were the tests I ran: using CuArrays, CUDAnative, FFTW
# test to confirm the issue with GPU memory for cufft library
function cufft_test(n)
a = CuArray(convert.(Complex{Float64}, ones(100,100)))
for i = 1:n
# Both fft() and fft!() have this issue
fft!(a)
end
end
# this function does nto allocate any more memory than necessary
function cufft_plan_test(n)
a = CuArray(convert.(Complex{Float64}, ones(100,100)))
plan = plan_fft!(a)
for i = 1:n
plan*a
end
end
# an attempt at seeing if julia allocates too much memory for the plans,
# themselves. Note: cufft plans are ints, so this should be fair?
function cufft_plan_mem_test(n)
a = CuArray(convert.(Complex{Float64}, ones(1)))
for i = 1:n
a = fft(a)
end
end
Here is the code for the to1(x::AbstractArray) = _to1(axes(x), x)
_to1(::Tuple{Base.OneTo,Vararg{Base.OneTo}}, x) = x
_to1(::Tuple, x) = copy1(eltype(x), x)
# implementations only need to provide plan_X(x, region)
# for X in (:fft, :bfft, ...):
for f in (:fft, :bfft, :ifft, :fft!, :bfft!, :ifft!, :rfft)
pf = Symbol("plan_", f)
@eval begin
$f(x::AbstractArray) = (y = to1(x); $pf(y) * y)
$f(x::AbstractArray, region) = (y = to1(x); $pf(y, region) * y)
$pf(x::AbstractArray; kws...) = (y = to1(x); $pf(y, 1:ndims(y); kws...))
end
end
If so, we could probably find a solution by reserving memory equal to the size of the arrays we will be |
I have done some digging around this problem, specifically the CuFFT error #1, and I think I've found the roots of the problem. To begin, I tested the same fft operation in CUDA/C. Here's the code: #include <cuda.h>
#include <cufft.h>
#include <stdio.h>
long N = 1*1;
void checkCufft(int res, const char *msg, int iter) {
if (res != CUFFT_SUCCESS) {
printf("CUFFT Error, Code %d, Message: %s, Iteration: %d\n", res, msg, iter);
exit(res);
}
}
int main() {
cufftHandle plan;
cufftComplex *data;
cudaMalloc((void**)&data, N * sizeof(cufftComplex));
for (int i = 0; i < 1025; i++) {
checkCufft(cufftPlan1d(&plan, N, CUFFT_C2C, 1), "Plan", i);
checkCufft(cufftExecC2C(plan, data, data, CUFFT_FORWARD), "Exec", i);
}
cudaDeviceSynchronize();
cufftDestroy(plan);
cudaFree(data);
printf("Program successful!\n");
} When running this, you will find that, consistently, the plan creation operation fails with error code 1 exactly on the 1024th iteration. Also, this failure is completely independent of N. Regardless of the size, the error occurs when 1024 plans exist at once. When the Because of this, I believe that there is an implicit limit in CuFFT on the number of plans you can have at one, and this is causing the error. This is clearly not intentional, since this boundary is not mentioned in the documentation, and because cufftPlan1d isn't even supposed to return that error code at all The reason that Julia is different is because of the slow garbage detection not causing these plans to be released quickly enough. Let's look closely at the implementation of the FFTs: (Side note: the example given by @leios works because AbstractFFT converts 1D arrays to cpu arrays, so a 1D CuArray does not have this problem) The key line is here: function fft(x)
plan_fft(x) * x
end And as mentioned by @wsphillips, "Allocation problems with cuFFT plans are addressed by storing the plan to a variable and later calling mutable struct cCuFFTPlan{...}
...
function cCuFFTPlan{...}(...)
p = new(...)
finalizer(destroy_plan, p)
p
end
end The finalizer function using CuArrays, FFTW
a = CuArray(ones(Complex{Float64}, 100, 100))
# Control: should fail
for i=1:10000
fft(a)
end
# Should fail because of the plan limit mentioned above
for i=1:10000
plan = plan_fft(a)
# plan * a # fails with or without this line
end
# Will succeed if it is the GC's fault for not finalizing fast enough
for i=1:10000
plan = plan_fft(a)
plan * a
finalize(plan) # Equivalently, CuArrays.CUFFT.destroy_plan(plan)
end When run, the last case passes. As such, we can see that it is the fact that the GC can't keep up with all the plans that causes the error. We could in theory remove the fft definition from AbstractFFTs altogether, and let each library define the FFT itself, allowing CuArrays to destroy the plan in their own FFT methods. What do you guys think? |
Libraries like cuFFT also perform allocations, which might fail due to outstanding references (ie. why we introduced the memory pool). See https://discourse.julialang.org/t/cuarray-and-optim/14053/7
Maybe we should try and generalize the pooling memory allocator for library allocations. On the other hand, these allocations are often non-reusable (eg.
cufftXtMalloc
is plan-specific), maybe we should just lift out the malloc/gc/reclaim logic.Specific to FFT, we should probably split
cuPlan1D
into manual plan creation + alloc to make this possible.The text was updated successfully, but these errors were encountered: