-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
NCCL tests don't work on WSL #442
Comments
I have exactly the same problem... |
Thanks for these reports. Currently NCCL is not supported on WSL2 installations but we are working on validating it. |
I think this is the reason why I cannot use multi-gpu training with PyTorch as well. Because when I use PyTorch DataParallel it give me similar error with NCCL. |
I also ran into the issue of NCCL simply not supporting WSL environments. It would have helped to have the lack of support documented right here https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations This might be the only place on the net a dev has said anything on the topic. |
Maybe I have the same error. I'm trying to use multigpu in two nodes where the one is wsl2 environment but seems that nccl communicator hangs displaying "cupy.cuda.nccl.NcclError: NCCL_ERROR_SYSTEM_ERROR: unhandled system error" only in the wsl2 side. Looking forward to the fix. |
Any update on this issue.. NCCL support for WSL2 is needed so that i can use Transfer Learning Toolkit 3 on my Windows desktop using WSL2 |
NCCL 2.10.3 was released last week and it should support WSL2 with a single GPU. Multi-GPU has not been validated yet. |
Still doesn't work with latest upgrades to TAO on WSL2 with newest driver 510.06... following is the output :
|
From your log:
Note, NCCL might have been compiled statically with tensorflow, so upgrading NCCL might not be enough to use the newest version. |
The current status should be that NCCL isn't supported (on multiple GPUs) for WSL. |
Same issue here with WSL2 (Windows 11), driver 510.06 and torch 1.9.1.cu111 with 2x 2080 Super. |
NCCL 2.11.4 has been tested on multi-GPU Win11 systems. I don't know what drivers and OS level are required though. You need to make sure that your pytorch/tensorflow subsystem hasn't been statically linked against an older NCCL version. |
@AddyLaddy Thanks for getting back to me. I checked and Torch 1.9.1.cu111 apparently uses NCCL 2.7.8. Will have to see what our options are now. |
@AddyLaddy How can I unlink the old NCCL from pytorch and update the NCCL of pytorch to version 2.11.4? I have installed version 2.11.4 in wsl2 and can pass the test by using nccl-tests. However, when training the model, pytorch 1.7.1 still calls NCCL 2.7.8 |
I'm not a PyTorch expert, but I believe you need to configure and rebuild it using the USE_SYSTEM_NCCL=1 option. Perhaps ask in a PyTorch forum for help? |
@AddyLaddy Thank you very much. I'll try to recompile PyTorch. |
hi. I've got the same issue recently. Did it work to recompile PyTorch? |
I've installed NCCL and its tests on WSL. When trying to run a test like this:
I get the following error message:
The debug log shows this:
Version of NCCL: version 2.8.3
Version of CUDA: 11.1
Windows: 10.0.20277
WSL: Ubuntu 20.04
The text was updated successfully, but these errors were encountered: