Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Train] Using Ray Train with fractional GPUs leads to NCCL error - Duplicate GPU detected #48012

Open
choosehappy opened this issue Oct 14, 2024 · 4 comments
Labels
bug Something that is supposed to be working; but isn't train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@choosehappy
Copy link

What happened + What you expected to happen

I have a machine with 2 GPUs. In my use case, I need to run multiple TorchTrainers concurrently on different datasets.

Ray Train is setup to avoid fragmenting jobs across different GPUs, but in this case this leads to an obvious NCCL error:

Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 3b000

I would expect that when using Ray Train, the TorchTrainer internally knows that it should provide CUDA_VISIBLE_DEVICES equivalent to a "spread" across GPUs as opposed to a "pack" onto a single GPU.

I can get "something" that looks like the correct behavior if launching the first train job with GPU:.6, so it spreads across both jobs, and then launching a second train job with GPU:.4.

However, if you run a job with anything < .5, then both workers will try to use the same GPU resulting in the NCCL error.

I tried playing aggressively with something like:

cuda_dev=torch.device('cuda',train.get_context().get_local_rank())
model=ray.train.torch.prepare_model(model,cuda_dev)

but internally the Ray device_manager really hates this and there doesn't seem to be working combination that allows it to slide through.

Any thoughts?

Versions / Dependencies

Ray 2.37, Python 3.10

Reproduction script

import ray
from ray.train import ScalingConfig
from torchvision.models import resnet18
import ray.train.torch

ray.init()


def trainpred_func(config):
    model = resnet18(num_classes=10)
    model=ray.train.torch.prepare_model(model)
    

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})
trainer = ray.train.torch.TorchTrainer(trainpred_func,scaling_config=scaling_config)
trainer.fit()

Issue Severity

High: It blocks me from completing my task.

@choosehappy choosehappy added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 14, 2024
@anyscalesam anyscalesam added the train Ray Train Related Issue label Oct 16, 2024
@dhirajtobii
Copy link

Does Ray support fractional scaling config for gpu workers with default nccl backend. I can see that above code works fine with gloo backend

I am also interested in knowing the status of this feature

@choosehappy
Copy link
Author

Coming back to this, i found a janky work around:

  1. create 1 docker instance per GPU on a node

start within the first container:

CUDA_VISIBLE_DEVICES=0 ray start --address='172.20.0.2:6379'

and then within the second container:

CUDA_VISIBLE_DEVICES=1 ray start --address='172.20.0.2:6379'

etc etc

  1. this essentially creates multiple "nodes", each with 1 GPU
  2. and thus one can request a placement_strategy of SPREAD, which will avoid the packing behavior described above:
scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1}, placement_strategy="SPREAD")

obviously this is a terrible idea for multiple reasons, for example, the total CPU count and RAM count, is 2x the correct amount. one can hack around that as well. one can also imagine a setting where the scheduler cannot SPREAD, and leave the system back in a fail state. STRICT_SPREAD is likely a smarter option to ensure failure if the spread wouldn't take place

but at least it "works" : )

@sud474
Copy link

sud474 commented Dec 16, 2024

Hi, Did you find a solution to this?

@choosehappy
Copy link
Author

only the janky one i mentioned above

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something that is supposed to be working; but isn't train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

4 participants