We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Driver Version: 535.129.03
MscclppAllReduce3
Hi MSCCL++ team,
Do you know if Driver Version: 535.129.03 has a bug that makes AllReduce3 to timeout?
Thanks, --Saeed
The text was updated successfully, but these errors were encountered:
Hmm... not tested based on this version. Azure hpc image using driver 535.86.10 and doesn't have this issue. https://github.com/Azure/azhpc-images/blob/63e5eaa23de69ccc1c6e6a52dff29037c88e96d4/ubuntu/common/install_nvidiagpudriver.sh#L16-L19
535.86.10
Sorry, something went wrong.
thanks @Binyang2014! Debugging this issue with nvidia.
Hi @saeedmaleki, is this issue resolved on your end? 535.154.05 is working good on my env.
535.154.05
it definitely still happens, i think this is a non-deterministic bug. NVIDIA couldn't reproduce it either. so maybe we could ignore it for now.
Actually, I can occasionally reproduce this bug. @Binyang2014 @aashaka please be aware.
No branches or pull requests
Hi MSCCL++ team,
Do you know if
Driver Version: 535.129.03
has a bug that makes AllReduce3 to timeout?Thanks,
--Saeed
The text was updated successfully, but these errors were encountered: