-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[Train] Add env vars to enable Share AMD ROCR_VISIBLE_DEVICES
#49346
[Train] Add env vars to enable Share AMD ROCR_VISIBLE_DEVICES
#49346
Conversation
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
@AVSuni @amorinConnor Feel free to take a look and review this PR. |
ROCM_VIDIABLE_DEVICES
ROCM_VISIBLE_DEVICES
@hongpeng-guo I believe AMD uses ROCR* in environmental variables, not ROCM* as you have it: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html I will run some tests to see if this fixes the issue today. |
Just as a follow up there are already some spots inside ray where ROCR* is utilized already. [python/ray/_private/accelerators/amd_gpu.py] for example. |
@hongpeng-guo After modifying your code to use ROCR* it looks like this fixes the issue. While I'm not able to run the original code ( I think due to another problem on my end) the following examples runs without error and rocm-smi shows all 4 gpus utilized:
|
Thank you so much for testing it out! Let me update this PR and try to get it merged soon. |
Got it! Thank you so much digging deep into it. The above code are from ray core level accelerator setup. In Ray Train, our abstraction is a bit different. But I think in the long run, maybe we can reuse the Ray Core accelerator utilities. cc @matthewdeng |
Signed-off-by: Hongpeng Guo <hpguo@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Fix env var naming from ROCM to ROCR. confirmed it's working on AMD devices, according to @amorinConnor
@matthewdeng PTAL.
ROCM_VISIBLE_DEVICES
ROCR_VISIBLE_DEVICES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
Why are these changes needed?
This PR enables to share
ROCR_VISIBLE_DEVICES
when using AMD GPUs. In this way, the devices can see and communicate with other GPU devices.Related issue number
#49260
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.