Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add ROCm as alternative to CUDA for plugin use #461

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ryanhankins
Copy link
Contributor

Description of changes:

See commit messages for more detail. Add a --with-rocm flag to configure.ac to switch between CUDA and ROCm GPU calls, to support AMD GPUs. Add code to fiiles to abstract CUDA calls, and, upon the use of the --with-rocm option, to call the ROCm alternatives.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ryanhankins ryanhankins changed the title Merge6 Add ROCm as alternative to CUDA for plugin use. Jun 27, 2024
@ryanhankins ryanhankins changed the title Add ROCm as alternative to CUDA for plugin use. Add ROCm as alternative to CUDA for plugin use Jun 27, 2024
@ryanhankins ryanhankins force-pushed the merge6 branch 8 times, most recently from 98282b4 to 064fb2c Compare June 27, 2024 18:32
@ryanhankins ryanhankins marked this pull request as ready for review June 28, 2024 11:19
@ryanhankins ryanhankins requested review from bwbarrett and a team as code owners June 28, 2024 11:19
@liralon
Copy link
Contributor

liralon commented Jun 28, 2024

@ryanhankins Can you please add to commit message some information on which platforms you have tested this functionality to work properly?

The nccl_net_ofi_cu* calls map directly to CUDA methods.  Instead of this
mapping, insert indirection via nccl_net_ofi_gpu methods so that the
implementation of the methods depends on CUDA, but the methods
themselves can be called for different underling frameworks (such as
ROCm).

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
ROCm provides an interface similar to CUDA, to work with AMD GPUs.
Provide a compile time option to build with ROCm instead of CUDA.

1. Add --with-rocm= flag to ./configure.
2. Make all CUDA calls "gpu" calls, which are independent of the
   underlying framework.
3. Switch between _rocm and _cuda files at compile time to make the
   appropriate calls.
4. When building for RCCL (AMD's NCCL), generate a rccl-net.so-named
   plugin for binary compatibility.

Tested on:

1. HPE Cray EX with EX235A BardPeak GPUs + 200Gb Slingshot adapters.
2. HPE Cray EX with NVIDIA A100 SXM4 80GB GPUs + 200 Gb Slingshot
    adapters.

Signed-off-by: Ryan Hankins <ryan.hankins@hpe.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants