Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] Nightly CI issue: CUDA 11.4 jobs were running with CUDA 11.8 when nccl wasn't available #2402

Open
dantegd opened this issue Jul 30, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@dantegd
Copy link
Member

dantegd commented Jul 30, 2024

NCCL 2.22.3.1 in conda-forge was not available for CUDA < 11.8 until yesterday, which was reflected in cuML's CI by failing all CUDA 11.4 jobs until today. But RAFT's CUDA 11.4 CI was passing regardless (which confused me for a while).

Checking the jobs, they were installing cuda-version 11.8 and corresponding packages, from this CUDA 11.4 log for example, the following snippets show the issue when installing the downloaded artifacts

  Upgrade:
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  - cuda-version                              11.4  hfb901f2_3                       conda-forge             Cached
  + cuda-version                              11.8  h70ddcb2_3                       conda-forge               21kB
  - cudatoolkit                             11.4.3  h39f8164_13                      conda-forge             Cached
  + cudatoolkit                             11.8.0  h4ba93d1_13                      conda-forge              716MB

which should not be happening on CUDA 11.4 jobs of course. I think this shouldn't be an issue now with nccl, but any other package could cause a situation like this, This could make things fail silently in the future and catch us by surprise, eliminating the point of having 11.4 jobs in nightly CI.

@dantegd dantegd added the bug Something isn't working label Jul 30, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant