Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Memory leak with CUDA-aware OpenMPI without UCX #12971

Open
kvoronin opened this issue Dec 9, 2024 · 2 comments
Open

Memory leak with CUDA-aware OpenMPI without UCX #12971

kvoronin opened this issue Dec 9, 2024 · 2 comments

Comments

@kvoronin
Copy link

kvoronin commented Dec 9, 2024

Hello!

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

4.1.7a1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Taken from NVIDIA's HPC SDK (more details in the logs)

Please describe the system on which you are running

  • Operating system/version:
    Linux, Ubuntu 22.04
  • Computer hardware:
  • 2 A100_40GB_PCIE
  • Network type:
    Not sure (any tips how to extract this information?)

Details of the problem

A simple reproducer calls in a loop cudaMalloc - MPI_Bcast - cudaFree and the device memory is checked via cudaGetMemInfo(). The expectation is that the same amount of device memory is available at each iteration, but in reality the amount of memory decreases and after long enough, this test would lead to a cudaMalloc failure due to running out of the device memory (thus I call it a memory leak).

Reproducer is compiled and run with

nvcc -O3 -DUSE_MPI -DUSE_OPENMPI -DCOUNT=20971520 repro_test.cu -o repro_test.out -lmpi

    mpirun  --allow-run-as-root --mca pml ^ucx --mca osc ^ucx --mca coll ^hcoll,ucc  -mca btl ^uct  \                                                
     --mca  mpi_show_mca_params enviro --mca mpi_show_mca_params_file mca_params_myfile_enviro_dbg9.txt \                                                                   
       --mca orte_base_help_aggregate 0 \                                                                                                                                   
       --mca btl_base_verbose 100 --mca mtl_base_verbose 100 \                                                                                                              
        -np 2 repro_test.out

Note: same reproducer also fails with UCX (which is turned off explicitly in the command above), but there I know UCX-specific workarounds and the issue is likely same as #12849. But for non-UCX case I am not sure if this is relevant at all.

Reproducer as *.txt:
repro_test.txt

Output example: (also contains output from ompi_info --parsable --config and
log_non_ucx2.log

Any suggestions?

Thanks,
Kirill

@bosilca bosilca self-assigned this Dec 9, 2024
@kvoronin
Copy link
Author

kvoronin commented Dec 9, 2024

Update: I've checked that adding --mca btl_smcuda_use_cuda_ipc 0 fixes the issue. As I understand, this is a workaround rather than a solution.

So I think it would be helpful if someone can comment on this w.r.t to when and how this will get fixed (hopefully)? I am a bit surprised that such a simple-looking (at least, for me) reproducer does not work. Maybe it should be added to a test suite or something like it.

@tdavidcl
Copy link

It seems indeed related to #12849. Basically pointer used for communication between two GPU gets registered for IPC, and the IPC handle is never released which prevents the memory from being freed resulting in a leak. So indeed disabling IPC fixes the issue.

In particular the issue never occurs if you always use the same buffer for communication. However if you keep changing it (like I or you did) the number of IPC handle will increase until you get the out of memory ...
Additionally, this mechanism from my understanding is indeed independent of UCX.

Note also that a very similar issue occur with MPIch in my case.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

4 participants