Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Bug] Is there a known bug with Driver Version: 535.129.03 which cases MscclppAllReduce3 to hang? #260

Open
saeedmaleki opened this issue Feb 6, 2024 · 5 comments

Comments

@saeedmaleki
Copy link
Contributor

Hi MSCCL++ team,

Do you know if Driver Version: 535.129.03 has a bug that makes AllReduce3 to timeout?

Thanks,
--Saeed

@Binyang2014
Copy link
Contributor

Hmm... not tested based on this version. Azure hpc image using driver 535.86.10 and doesn't have this issue.
https://github.com/Azure/azhpc-images/blob/63e5eaa23de69ccc1c6e6a52dff29037c88e96d4/ubuntu/common/install_nvidiagpudriver.sh#L16-L19

@saeedmaleki
Copy link
Contributor Author

thanks @Binyang2014! Debugging this issue with nvidia.

@chhwang
Copy link
Contributor

chhwang commented Mar 26, 2024

Hi @saeedmaleki, is this issue resolved on your end? 535.154.05 is working good on my env.

@saeedmaleki
Copy link
Contributor Author

it definitely still happens, i think this is a non-deterministic bug. NVIDIA couldn't reproduce it either. so maybe we could ignore it for now.

@chhwang
Copy link
Contributor

chhwang commented Apr 6, 2024

Actually, I can occasionally reproduce this bug. @Binyang2014 @aashaka please be aware.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants