Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

Open
choosehappy opened this issue Jan 21, 2025 · 0 comments
Labels
bug Something that is supposed to be working; but isn't train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@choosehappy
Copy link

What happened + What you expected to happen

Setting a GPU to a fractional value appears to cause RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES to be ignored when using TorchTrainer, as demonstrated below:

I’m using Ray 2.24, and this works as expected

def trainpred_func(config):
    print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
    print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
    time.sleep(100)
    

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True)
trainer = ray.train.torch.TorchTrainer(trainpred_func,scaling_config=scaling_config)
trainer.fit()    

With output:

(RayTrainWorker pid=18626) Setting up process group for: en[v://](file:///V:/) [rank=0, world_size=2]
(TorchTrainer pid=18539) Started distributed worker processes: 
(TorchTrainer pid=18539) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18626) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=18539) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18625) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=18626) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18626) os.environ['CUDA_VISIBLE_DEVICES']='0,1'
(RayTrainWorker pid=18625) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18625) os.environ['CUDA_VISIBLE_DEVICES']='0,1'

However adding a fractional GPU resource like this

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})

Now causes this output:

(RayTrainWorker pid=18351) Setting up process group for: en[v://](file:///V:/) [rank=0, world_size=2]
(TorchTrainer pid=18260) Started distributed worker processes: 
(TorchTrainer pid=18260) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18351) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=18260) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18352) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=18352) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18352) os.environ['CUDA_VISIBLE_DEVICES']='0'
(RayTrainWorker pid=18351) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18351) os.environ['CUDA_VISIBLE_DEVICES']='0'

We're still attempting to work elegantly around the lack of GPU spreading, as discussed here #48012 . Self management of the GPUs would be an easy acceptable solution!

Versions / Dependencies

ray==2.40.0
Python 3.10.12
Docker container: nvcr.io/nvidia/pytorch:24.08-py3

Reproduction script

As provided above

Issue Severity

High: It blocks me from completing my task.

@choosehappy choosehappy added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 21, 2025
@jcotant1 jcotant1 added the train Ray Train Related Issue label Jan 21, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something that is supposed to be working; but isn't train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants