-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[BUG] The code for deepspeed.comm.comm.monitored_barrier() #4488
Comments
@phalexo thanks for bringing this to our attention. I believe that @Quentin-Anthony you were the last one to touch this line. Can you comment? Thanks! |
Thank you for looking at this. The initial problem for me is that there does not seem to be a way to change the timeout. I went as far as changing the hardcoded value of "default_pg_time" to 5 hours, but for some reason at 30 minutes an exception is raised anyway. There must be some way to control the timeout for the default group. |
Fixing this bug may or not allow to change the timeout. See below. #pragma once #include <condition_variable> #include <ATen/ATen.h> #include <torch/csrc/distributed/c10d/Work.hpp> constexpr auto kBackendDefaultTimeout = namespace c10d { class TORCH_API Backend : public torch::CustomClassHolder { // Backend Options is a base struct that defines the basic options FILE: ~/pytorch/torch/csrc/distributed/c10d/Backend.hpp |
@mrwyattii -- You're correct! Looks like a typo. I introduced a PR in #4496. @phalexo -- I believe the cause of your issue is that |
I am not sure how you managed to test it with 'nccl' backend. This is what I got: Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.06s/it] There is this bit of code, in the distributed_c10d.py file.
It seems only the deepspeed/comm/torch.py has been modified, but deepspeed/comm/comm.py remains the same. I will test with both and see if it makes a difference. So, torch.distributed.monitored_barrier() is NOT implemented for "nccl', just for 'gloo' deepspeed.comm.comm.py has not been modified, so it gives the same error: Traceback (most recent call last): And then there is this: constexpr auto kBackendDefaultTimeout = namespace c10d { class TORCH_API Backend : public torch::CustomClassHolder { // Backend Options is a base struct that defines the basic options FILE: ~/pytorch/torch/csrc/distributed/c10d/Backend.hpp |
The exception happens deep inside the C/C++ code in pytorch/nccl, @Quentin-Anthony Somehow one has to be able to adjust the timeout inside the backend. frame #29: PyEval_EvalCodeEx + 0x39 (0x5568d06e5fd9 in /home/developer/mambaforge/envs/FinGPT/bin/python3.8) |
As an experiment I replaced "barrier" calls with "all_reduce" since essentially from the timing point of view it does the same synchronization as "barrier" calls. I got exactly the same error, with the socket timeout, except the function name in the trace was now "all_reduce" instead of "barrier". Could you post your code to experiment with 5, 45 minute timeouts? |
I don't think this is correct? Here's the PR change: https://github.com/microsoft/DeepSpeed/pull/4496/files#diff-2bf8bf2cd0ea4637cdd3397e6e27d38ce59e0ebc8daafcc583dbfcd8c092b0beR419
Yep you're right. I was using an older custom version of torch with What I ran was essentially the following:
Can you try this? Also, I think your socket timeouts:
Are completely unrelated to the |
@Quentin-Anthony That said, what I think happens is this: if rank > 0:
if rank == 0: All ranks above 0 are stopped at the barrier() above, waiting for rank 0 to finish preprocessing and rejoin at the second barrier(). When about 30 minutes hit, they assume rank 0 is dead and some kind of exception processing happens. I have looked at some settings for sockets and the timeout is 10 minutes, but there are 3 retries for sockets. So maybe 30 minutes come from that. Why isn't monitored_barrier implemented for nccl? I thought it is faster than gloo. if dist.get_rank() != 1: // Should this be 1? or -1, just a choice? or 0? p.s. If I change the backend to gloo and barrier() to monitored_barrier(timeout=timedelta(minutes)) the function should internally adjust the timeout for sockets too. What is the point of a having a process wait at a barrier if a socket timeout causes an exception anyway? Sockets have to be set to "keep_alive" or a longer timeout. CONCLUSIONS: It is still a problem for GLOO, unless one is aware that one cannot use barrier() and has to use monitored_barrier() with a timeout. So, the socket timeout is related to the barrier timeout in the sense that a correctly working barrier with a longer timeout prevents ranks > 0 from reaching the socket timeout point. I don't understand though why changing the default timeout values in torch source, in Backend.hpp, ProcessGroup.hpp and rebuilding PyTorch did not solve the problem.
|
Because
Just a choice. My code is just triggering the case where a single rank (1 in this case) doesn't enter the barrier.
Wouldn't the 30-minute timeout just be from the default pytorch pg timeout? https://github.com/microsoft/DeepSpeed/blob/78c518ed979b4cf1df1816bdbe52b0d21531bee4/deepspeed/comm/comm.py#L608 |
default_pg_timeout itself is NOT set to an actual value, but instead is set to another variable. This is a recent changedefault_pg_timeout = timedelta(minutes=int(os.getenv("DEEPSPEED_TIMEOUT", default=30))) Although this works to change the value, it does not appear to work to change the actual timeout. I don't know why. It is still a total puzzle to me why values defined in Backend.hpp and ProcessGroup.hpp are not controlling all the other values, including default_pg_timeout. I built pytorch from source after modifying those timeout values by 10x, but it had no effect. If nccl is strictly GPU side, then why can't one have CPU side functions for barriers? Why do I need to change the backend? I am not quite getting this. I am not trying to synch the GPU processes, just the CPU side preprocessing. |
Still a problem, even with GLOO backend. The timeout for barriers is set for 10 hours, but they timed out at 6.6 apparently. The argument |
takes a "timeout" parameter but then INTERNALLY calls a function
*.barrier() that does NOT take the "timeout" paramter.
So, it is useless.
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
ds_report output
Please run
ds_report
to give us details about your setup.Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context
Are you launching your experiment with the
deepspeed
launcher, MPI, or something else?Docker context
Are you using a specific docker image that you can share?
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: