[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

choosehappy · 2025-01-21T18:28:17Z

What happened + What you expected to happen

Setting a GPU to a fractional value appears to cause RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES to be ignored when using TorchTrainer, as demonstrated below:

I’m using Ray 2.24, and this works as expected

def trainpred_func(config):
    print(f"{os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']=}")
    print(f"{os.environ['CUDA_VISIBLE_DEVICES']=}")
    time.sleep(100)
    

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True)
trainer = ray.train.torch.TorchTrainer(trainpred_func,scaling_config=scaling_config)
trainer.fit()

With output:

(RayTrainWorker pid=18626) Setting up process group for: en[v://](file:///V:/) [rank=0, world_size=2]
(TorchTrainer pid=18539) Started distributed worker processes: 
(TorchTrainer pid=18539) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18626) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=18539) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18625) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=18626) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18626) os.environ['CUDA_VISIBLE_DEVICES']='0,1'
(RayTrainWorker pid=18625) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18625) os.environ['CUDA_VISIBLE_DEVICES']='0,1'

However adding a fractional GPU resource like this

scaling_config = ray.train.ScalingConfig(num_workers=2, use_gpu=True, resources_per_worker={"GPU":.1})

Now causes this output:

(RayTrainWorker pid=18351) Setting up process group for: en[v://](file:///V:/) [rank=0, world_size=2]
(TorchTrainer pid=18260) Started distributed worker processes: 
(TorchTrainer pid=18260) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18351) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=18260) - (node_id=5c09e3f8571d11ef40b89b18c70c6dfccadbc36f018d34c89d700902, ip=172.17.0.3, pid=18352) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=18352) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18352) os.environ['CUDA_VISIBLE_DEVICES']='0'
(RayTrainWorker pid=18351) os.environ['RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES']='1'
(RayTrainWorker pid=18351) os.environ['CUDA_VISIBLE_DEVICES']='0'

We're still attempting to work elegantly around the lack of GPU spreading, as discussed here #48012 . Self management of the GPUs would be an easy acceptable solution!

Versions / Dependencies

ray==2.40.0
Python 3.10.12
Docker container: nvcr.io/nvidia/pytorch:24.08-py3

Reproduction script

As provided above

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

choosehappy added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 21, 2025

jcotant1 added the train Ray Train Related Issue label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

choosehappy commented Jan 21, 2025

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

[Train] Is RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES expected to work in Train? #49985

Comments

choosehappy commented Jan 21, 2025

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity