-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Torchvision decode_jpeg memory leak #4378
Comments
@fmassa Is general |
@NicolasHug @fmassa Also having this issue. Tried loading images on loop using |
same problem: |
Thanks all for the reports. I took a look a this today. I can reproduce the leak. I do see the memory usage going up constantly with
I thought the leak might come from the fact that we don't free the nvjpeg handle (we literally leak it for convenience) vision/torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp Lines 28 to 30 in 9ae0169
but that's not the case: putting back the handle within the function and properly destroying it with I don't see the leak anymore when commenting out the I don't know whether that's actually a bug from nvjpeg, or if there's something else going on. Either way, I don't understand. nvjpeg allows to pass custom device memory allocators, perhaps there is something to do there. Cheers |
Update: this still leaks 🥲 int dev_malloc(void **p, size_t s) {
*p = c10::cuda::CUDACachingAllocator::raw_alloc(s);
return 0;
}
int dev_free(void *p) {
c10::cuda::CUDACachingAllocator::raw_delete(p);
return 0;
}
...
nvjpegDevAllocator_t dev_allocator = {&dev_malloc, &dev_free};
nvjpegStatus_t status = nvjpegCreateEx(NVJPEG_BACKEND_DEFAULT, &dev_allocator,
NULL, NVJPEG_FLAGS_DEFAULT, &nvjpeg_handle); |
I had a chance to look at this more: this is an nvjpeg bug. Unfortunately I'm not sure we can do much about it. It was fixed with CUDA 11.6 but I'm still observing the leak with 11.0 - 11.5. A temporary fix for linux users is to download the 11.6 nvjpeg.so e.g. from here and to tell |
Hello @NicolasHug thanks for answer! I reinstalled CUDA, now I have this version But problem does not disappear. Should I rebuild torch with cuda 11.6 from source? |
What does |
@NicolasHug Mine is showing /site-packages/torchvision/../torchvision.libs/libnvjpeg.90286a3c.so.11 How do I fix this to use system cuda? |
@rydenisbak Did you figure this out? |
@Scass0807 if the path is coming from |
@NicolasHug
|
@NicolasHug I added a symlink libnvjpeg.90286a3c.so.11 -> /usr/local/cuda-11.6/lib64/libnvjpeg.so.11. Now there is only 1 nvjpeg but the memory leak persists. I wonder if this is because even though I am using 11.6 my driver version is 495.23 which is technically for 11.5. I am using GCP Compute Engine and unfortunately they do not yet support 511. |
Hi, @NicolasHug, Would you mind telling where do you get this information? i could not find it in the cuda 11.6 release note. And I cannot reproduce this memory leak with cuda 10.2 (docker pull nvcr.io/nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04) . It would be great if there is some more information. |
I basically tried all versions I could find from https://pkgs.org/search/?q=libnvjpeg-devel |
@NicolasHug should installing 11.6 and using the one that CUDA was built with it work? Do I have to install the RPM I’m on Ubuntu? Based on the LDD results from above I’m not sure if there’s anything else I can do? |
It seems that there is a small multithread confusion here :
the nvjpeg_handle_creation_flag should be global, not local. |
It does not work with cuda 11.7 libnvjpeg either. But same behavior is observed when using numpy.frombuffer. Now I have to decode jpegs on a cpu like a pheasant :'( |
Also seeing this issue on CUDA 11.6 (running in a docker container):
|
I just checked if this was fixed in pytorch nightly with cuda 11.6, but i'm still experiencing a memory leak.
|
same ^ |
Yes, there are still leaks, even on cuda 11.6 |
Memory leaks on torchvision-0.14.0+cu117 (torchvision-0.14.0%2Bcu117-cp37-cp37m-win_amd64.whl). easy to reproduce: for i in range(10000):
torchvision.io.decode_jpeg(torch.frombuffer(jpeg_bytes,dtype=torch.uint8), device='cuda') Memory leaks didn't happen when using pynvjpeg 0.0.13, which seems to be built with cuda 10.2 nj = NvJpeg()
nj.decode(jpeg_bytes) |
Is there anyone who solve this problem?? I also tried to use pynvjpeg, it is slower than torchvision.io.decode_jpeg and also at last some error msg pops up like this : what() : memory allocator error aborted (core dumped).. |
It seems that this problem has been solved. My environment is as follows finally, after waiting for over a year. :) |
🐛 Describe the bug
nvJPEG leaks memory and fails with OOM after ~1-2k images.
Probably related to first response to #3848
is exactly the message you get after OOM.
Versions
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Arch Linux (x86_64)
GCC version: (GCC) 11.1.0
Clang version: 12.0.1
CMake version: version 3.21.1
Libc version: glibc-2.33
Python version: 3.8.7 (default, Jan 19 2021, 18:48:37) [GCC 10.2.0] (64-bit runtime)
Python platform: Linux-5.13.8-arch1-1-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.4.48
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti
GPU 2: NVIDIA GeForce GTX 1080
Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib/libcudnn.so.8.2.2
/usr/lib/libcudnn_adv_infer.so.8.2.2
/usr/lib/libcudnn_adv_train.so.8.2.2
/usr/lib/libcudnn_cnn_infer.so.8.2.2
/usr/lib/libcudnn_cnn_train.so.8.2.2
/usr/lib/libcudnn_ops_infer.so.8.2.2
/usr/lib/libcudnn_ops_train.so.8.2.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] adabelief-pytorch==0.2.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] pytorch-lightning==1.4.5
[pip3] torch==1.9.0+cu111
[pip3] torchaudio==0.9.0
[pip3] torchfile==0.1.0
[pip3] torchmetrics==0.4.1
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect
The text was updated successfully, but these errors were encountered: