-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
whether nccl do not support virsual machines #575
Comments
I'm not familiar with bitfusion, but it seems to be sharing a GPU between multiple VM instances. This is very likely incompatible with NCCL, given each NCCL rank needs to run on a different physical GPU. |
Thank you for your reply. I got the information from the link(https://docs.vmware.com/en/VMware-vSphere-Bitfusion/2.5/rn/vmware-vsphere-bitfusion-compatibility-interop.html). Vmware stated that bitfusion supportt NCCL version 2.3, 2.4, 2.5, 2.8 and later). What I used in my experiment is NCCL 2.7.8, which is not support in bitfusion. I cannot change the NCCL version even if I reinstall NCCL library. |
From that page:
I'm not sure what that means but it sounds like what I meant before. Now, regarding the NCCL version, many frameworks builds come with a NCCL version baked in, so you can't replace NCCL with a different version without changing or rebuilding the framework. |
OK! Thank you for your reply! |
Recently, I have got a VM with 2 A100 GPU. My team use the new
DELL XE8545 server
(https://infohub.delltechnologies.com/p/accelerating-hpc-workloads-with-nvidia-a100-nvlink-on-dell-poweredge-xe8545/) andbitfusion
(https://docs.vmware.com/en/VMware-vSphere-Bitfusion/index.html) to create the virsual machine. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server(2 TITAN GPUs) without bitfusion. I want to know that whether nccl do not support such virsual machines? I know that nccl donot support WSL(#442 (comment)).I am looking forward to your reply.
The text was updated successfully, but these errors were encountered: