Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Train] Add env vars to enable Share AMD ROCR_VISIBLE_DEVICES #49346

Merged
merged 3 commits into from
Dec 19, 2024

Conversation

hongpeng-guo
Copy link
Contributor

@hongpeng-guo hongpeng-guo commented Dec 19, 2024

Why are these changes needed?

This PR enables to share ROCR_VISIBLE_DEVICES when using AMD GPUs. In this way, the devices can see and communicate with other GPU devices.

Related issue number

#49260

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@hongpeng-guo
Copy link
Contributor Author

@AVSuni @amorinConnor Feel free to take a look and review this PR.

@pcmoritz pcmoritz changed the title [Train] Add env vars to enable Share AMD ROCM_VIDIABLE_DEVICES [Train] Add env vars to enable Share AMD ROCM_VISIBLE_DEVICES Dec 19, 2024
@amorinConnor
Copy link

@hongpeng-guo I believe AMD uses ROCR* in environmental variables, not ROCM* as you have it:

https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html

I will run some tests to see if this fixes the issue today.

@amorinConnor
Copy link

@hongpeng-guo I believe AMD uses ROCR* in environmental variables, not ROCM* as you have it:

https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html

I will run some tests to see if this fixes the issue today.

Just as a follow up there are already some spots inside ray where ROCR* is utilized already.

[python/ray/_private/accelerators/amd_gpu.py] for example.

@amorinConnor
Copy link

@hongpeng-guo After modifying your code to use ROCR* it looks like this fixes the issue. While I'm not able to run the original code ( I think due to another problem on my end) the following examples runs without error and rocm-smi shows all 4 gpus utilized:



import os
import tempfile

import torch
from torch import nn
from torch.nn.parallel import DistributedDataParallel

import ray
from ray.train import Checkpoint, CheckpointConfig, RunConfig, ScalingConfig
from ray.train.torch import TorchTrainer

# If using GPUs, set this to True.
use_gpu = True
# Number of processes to run training on.
num_workers = 4
# del os.environ['OMP_PLACES']
# del os.environ['OMP_PROC_BIND']
# Define your network structure.
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(1, 32)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(32, 1)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

# Training loop.
def train_loop_per_worker(config):

    # Read configurations.
    lr = config["lr"]
    batch_size = config["batch_size"]
    num_epochs = config["num_epochs"]

    # Fetch training dataset.
    train_dataset_shard = ray.train.get_dataset_shard("train")

    # Instantiate and prepare model for training.
    model = NeuralNetwork()
    model = ray.train.torch.prepare_model(model)
    print("Pass")
    # Define loss and optimizer.
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    # Create data loader.
    dataloader = train_dataset_shard.iter_torch_batches(
        batch_size=batch_size, dtypes=torch.float
    )

    # Train multiple epochs.
    for epoch in range(num_epochs):

        # Train epoch.
        for batch in dataloader:
            output = model(batch["input"])
            loss = loss_fn(output, batch["label"])
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Create checkpoint.
        base_model = (model.module
            if isinstance(model, DistributedDataParallel) else model)
        checkpoint_dir = tempfile.mkdtemp()
        torch.save(
            {"model_state_dict": base_model.state_dict()},
            os.path.join(checkpoint_dir, "model.pt"),
        )
        checkpoint = Checkpoint.from_directory(checkpoint_dir)

        # Report metrics and checkpoint.
        ray.train.report({"loss": loss.item()}, checkpoint=checkpoint)


# Define configurations.
train_loop_config = {"num_epochs": 50, "lr": 0.01, "batch_size": 32}
scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1))

# Define datasets.
train_dataset = ray.data.from_items(
    [{"input": [x], "label": [2 * x + 1]} for x in range(2000)]
)
datasets = {"train": train_dataset}

# Initialize the Trainer.
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    train_loop_config=train_loop_config,
    scaling_config=scaling_config,
    run_config=run_config,
    datasets=datasets
)

# Train the model.
result = trainer.fit()

# Inspect the results.
final_loss = result.metrics["loss"]

@hongpeng-guo
Copy link
Contributor Author

@hongpeng-guo After modifying your code to use ROCR* it looks like this fixes the issue. While I'm not able to run the original code ( I think due to another problem on my end) the following examples runs without error and rocm-smi shows all 4 gpus utilized:



import os
import tempfile

import torch
from torch import nn
from torch.nn.parallel import DistributedDataParallel

import ray
from ray.train import Checkpoint, CheckpointConfig, RunConfig, ScalingConfig
from ray.train.torch import TorchTrainer

# If using GPUs, set this to True.
use_gpu = True
# Number of processes to run training on.
num_workers = 4
# del os.environ['OMP_PLACES']
# del os.environ['OMP_PROC_BIND']
# Define your network structure.
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(1, 32)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(32, 1)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

# Training loop.
def train_loop_per_worker(config):

    # Read configurations.
    lr = config["lr"]
    batch_size = config["batch_size"]
    num_epochs = config["num_epochs"]

    # Fetch training dataset.
    train_dataset_shard = ray.train.get_dataset_shard("train")

    # Instantiate and prepare model for training.
    model = NeuralNetwork()
    model = ray.train.torch.prepare_model(model)
    print("Pass")
    # Define loss and optimizer.
    loss_fn = nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    # Create data loader.
    dataloader = train_dataset_shard.iter_torch_batches(
        batch_size=batch_size, dtypes=torch.float
    )

    # Train multiple epochs.
    for epoch in range(num_epochs):

        # Train epoch.
        for batch in dataloader:
            output = model(batch["input"])
            loss = loss_fn(output, batch["label"])
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        # Create checkpoint.
        base_model = (model.module
            if isinstance(model, DistributedDataParallel) else model)
        checkpoint_dir = tempfile.mkdtemp()
        torch.save(
            {"model_state_dict": base_model.state_dict()},
            os.path.join(checkpoint_dir, "model.pt"),
        )
        checkpoint = Checkpoint.from_directory(checkpoint_dir)

        # Report metrics and checkpoint.
        ray.train.report({"loss": loss.item()}, checkpoint=checkpoint)


# Define configurations.
train_loop_config = {"num_epochs": 50, "lr": 0.01, "batch_size": 32}
scaling_config = ScalingConfig(num_workers=num_workers, use_gpu=use_gpu)
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=1))

# Define datasets.
train_dataset = ray.data.from_items(
    [{"input": [x], "label": [2 * x + 1]} for x in range(2000)]
)
datasets = {"train": train_dataset}

# Initialize the Trainer.
trainer = TorchTrainer(
    train_loop_per_worker=train_loop_per_worker,
    train_loop_config=train_loop_config,
    scaling_config=scaling_config,
    run_config=run_config,
    datasets=datasets
)

# Train the model.
result = trainer.fit()

# Inspect the results.
final_loss = result.metrics["loss"]

Thank you so much for testing it out! Let me update this PR and try to get it merged soon.

@hongpeng-guo
Copy link
Contributor Author

@hongpeng-guo I believe AMD uses ROCR* in environmental variables, not ROCM* as you have it:
https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html
I will run some tests to see if this fixes the issue today.

Just as a follow up there are already some spots inside ray where ROCR* is utilized already.

[python/ray/_private/accelerators/amd_gpu.py] for example.

Got it! Thank you so much digging deep into it. The above code are from ray core level accelerator setup. In Ray Train, our abstraction is a bit different. But I think in the long run, maybe we can reuse the Ray Core accelerator utilities. cc @matthewdeng

Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
Copy link
Contributor Author

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Fix env var naming from ROCM to ROCR. confirmed it's working on AMD devices, according to @amorinConnor

@matthewdeng PTAL.

@matthewdeng matthewdeng changed the title [Train] Add env vars to enable Share AMD ROCM_VISIBLE_DEVICES [Train] Add env vars to enable Share AMD ROCR_VISIBLE_DEVICES Dec 19, 2024
Copy link
Contributor

@matthewdeng matthewdeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@matthewdeng matthewdeng enabled auto-merge (squash) December 19, 2024 21:50
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 19, 2024
@matthewdeng matthewdeng merged commit 202d0dc into ray-project:master Dec 19, 2024
6 of 7 checks passed
@hongpeng-guo hongpeng-guo deleted the hpguo/AMD_GPU_devices branch December 20, 2024 01:59
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants