You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a machine with 2 GPUs. In my use case, I need to run multiple TorchTrainers concurrently on different datasets.
Ray Train is setup to avoid fragmenting jobs across different GPUs, but in this case this leads to an obvious NCCL error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 3b000
I would expect that when using Ray Train, the TorchTrainer internally knows that it should provide CUDA_VISIBLE_DEVICES equivalent to a "spread" across GPUs as opposed to a "pack" onto a single GPU.
I can get "something" that looks like the correct behavior if launching the first train job with GPU:.6, so it spreads across both jobs, and then launching a second train job with GPU:.4.
However, if you run a job with anything < .5, then both workers will try to use the same GPU resulting in the NCCL error.
The text was updated successfully, but these errors were encountered:
choosehappy
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Oct 14, 2024
obviously this is a terrible idea for multiple reasons, for example, the total CPU count and RAM count, is 2x the correct amount. one can hack around that as well. one can also imagine a setting where the scheduler cannot SPREAD, and leave the system back in a fail state. STRICT_SPREAD is likely a smarter option to ensure failure if the spread wouldn't take place
What happened + What you expected to happen
I have a machine with 2 GPUs. In my use case, I need to run multiple TorchTrainers concurrently on different datasets.
Ray Train is setup to avoid fragmenting jobs across different GPUs, but in this case this leads to an obvious NCCL error:
I would expect that when using Ray Train, the TorchTrainer internally knows that it should provide CUDA_VISIBLE_DEVICES equivalent to a "spread" across GPUs as opposed to a "pack" onto a single GPU.
I can get "something" that looks like the correct behavior if launching the first train job with GPU:.6, so it spreads across both jobs, and then launching a second train job with GPU:.4.
However, if you run a job with anything < .5, then both workers will try to use the same GPU resulting in the NCCL error.
I tried playing aggressively with something like:
but internally the Ray device_manager really hates this and there doesn't seem to be working combination that allows it to slide through.
Any thoughts?
Versions / Dependencies
Ray 2.37, Python 3.10
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: