Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

whether nccl do not support virsual machines #575

Closed
ljz756245026 opened this issue Sep 27, 2021 · 4 comments
Closed

whether nccl do not support virsual machines #575

ljz756245026 opened this issue Sep 27, 2021 · 4 comments

Comments

@ljz756245026
Copy link

Recently, I have got a VM with 2 A100 GPU. My team use the new DELL XE8545 server(https://infohub.delltechnologies.com/p/accelerating-hpc-workloads-with-nvidia-a100-nvlink-on-dell-poweredge-xe8545/) andbitfusion(https://docs.vmware.com/en/VMware-vSphere-Bitfusion/index.html) to create the virsual machine. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server(2 TITAN GPUs) without bitfusion. I want to know that whether nccl do not support such virsual machines? I know that nccl donot support WSL(#442 (comment)).

I am looking forward to your reply.

@sjeaugey
Copy link
Member

I'm not familiar with bitfusion, but it seems to be sharing a GPU between multiple VM instances. This is very likely incompatible with NCCL, given each NCCL rank needs to run on a different physical GPU.

@ljz756245026
Copy link
Author

I'm not familiar with bitfusion, but it seems to be sharing a GPU between multiple VM instances. This is very likely incompatible with NCCL, given each NCCL rank needs to run on a different physical GPU.

Thank you for your reply. I got the information from the link(https://docs.vmware.com/en/VMware-vSphere-Bitfusion/2.5/rn/vmware-vsphere-bitfusion-compatibility-interop.html). Vmware stated that bitfusion supportt NCCL version 2.3, 2.4, 2.5, 2.8 and later). What I used in my experiment is NCCL 2.7.8, which is not support in bitfusion. I cannot change the NCCL version even if I reinstall NCCL library.
Do you have any suggestions about how to reinstall the NCCL version? I tried it and reboot the mahcine, however the nccl version did not change.

@sjeaugey
Copy link
Member

sjeaugey commented Sep 28, 2021

From that page:

Using NCCL with multi-process applications that run on different vSphere Bitfusion clients is not supported.

I'm not sure what that means but it sounds like what I meant before.

Now, regarding the NCCL version, many frameworks builds come with a NCCL version baked in, so you can't replace NCCL with a different version without changing or rebuilding the framework.

@ljz756245026
Copy link
Author

From that page:

Using NCCL with multi-process applications that run on different vSphere Bitfusion clients is not supported.

I'm not sure what that means but it sounds like what I meant before.

Now, regarding the NCCL version, many frameworks builds come with a NCCL version baked in, so you can't replace NCCL with a different version without changing or rebuilding the framework.

OK! Thank you for your reply!
I know that the problem is caused by Bitfusion software. It is not NCCL bugs. Thank you for your patient!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants