-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feature request: runTheMatrix.py should assign a different GPU to each job #47337
Comments
cms-bot internal usage |
A new Issue was created by @fwyzard. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign Configuration/PyReleaseValidation |
assign heterogeneous |
New categories assigned: pdmv,upgrade,heterogeneous @AdrianoDee,@DickyChant,@fwyzard,@makortel,@miquork,@Moanwar,@srimanob,@subirsarkar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Apart from running something like |
No, |
And follow-up question: how do we count the GPUs available on a node? I imagine we should do this outside any |
I would think of building on top of our cmssw/HeterogeneousCore/CUDAUtilities/bin/cudaComputeCapabilities.cpp Lines 25 to 27 in 69407d5
(similarly for |
We can reuse the same mechanism that SCRAM uses to detect the GPUs: if cudaIsEnabled; then
# there is at least one supported NVIDIA GPU
else
# there are no supported NVIDIA GPUs
fi and if rocmIsEnabled; then
# there is at least one supported AMD GPU
else
# there are no supported AMD GPUs
fi Then you can enumerate the GPUs that are available and supported with cudaComputeCapabilities | grep -v unsupported and rocmComputeCapabilities | grep -v unsupported To select what NVIDIA GPU to use (_e.g. CUDA_VISIBLE_DEVICES=0 cmsRun ... and for AMD GPUs (e.g. CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 cmsRun ... |
They are almost equivalent: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html .
On a machine that has only AMD GPUs any of them works. On a machine that has both NVIDIA and AMD GPUs (rare in practice, but I use one for testing) you need to set both
Note that fwyzard@gputest-genoa-02.cms:~$ cudaComputeCapabilities
0 8.9 NVIDIA L40S
1 8.9 NVIDIA L4
fwyzard@gputest-genoa-02.cms:~$ CUDA_VISIBLE_DEVICES= cudaComputeCapabilities
cudaComputeCapabilities: no CUDA-capable device is detected
fwyzard@gputest-genoa-02.cms:~$ rocmComputeCapabilities
0 gfx1100 AMD Radeon PRO W7800
fwyzard@gputest-genoa-02.cms:~$ CUDA_VISIBLE_DEVICES= rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected
fwyzard@gputest-genoa-02.cms:~$ CUDA_VISIBLE_DEVICES= HIP_VISIBLE_DEVICES=0 rocmComputeCapabilities
0 gfx1100 AMD Radeon PRO W7800
fwyzard@gputest-genoa-02.cms:~$ CUDA_VISIBLE_DEVICES= ROCR_VISIBLE_DEVICES=0 rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected
fwyzard@gputest-genoa-02.cms:~$ CUDA_VISIBLE_DEVICES= GPU_DEVICE_ORDINAL=0 rocmComputeCapabilities
rocmComputeCapabilities: no ROCm-capable device is detected |
A proposed solution in #47377 |
runTheMatrix.py
creates and executes jobs without any kind of GPU assignment.On a machine with a single GPU, his is not an issue.
On a machine with more than one GPU, for example
or
the result is that all jobs try to use all GPUs, which is quite inefficient.
A better approach would be to assign a different GPU to each job, for example in a round-robin fashion.
If there are more concurrent jobs than GPUs, the GPUs will be shared - but to a much lesser extent than now.
The text was updated successfully, but these errors were encountered: