-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[QST] Error on invalid device ordinal and cuda_memory_resource #4806
Comments
Thanks for filing this! I think this might be a better fit for the cuGraph repository. The RMM failure you're observing is probably due to an earlier bug occurring in cuGraph's code. I will transfer this issue there. cc: @ChuckHastings for awareness. |
Thank you for the response! I think you are right! I compiled I am open to debug/solve ideas. Thanks |
Ok, I am not solved the problem but I found a workaround. The problem looks undefined behavior in some part of the code. I am having this problem if I build with debug symbols. |
We have not been able to build a complete build of cugraph with debug symbols in a while (the overall code is too big). Can you share some simple code that reproduces your error? I can try and reproduce it myself and that would make it easier to try and diagnose the problem you are seeing. |
First of all thank you for your response! Louvain tests are failing (Rmat32, Rmat64). I assume you may test those tests. I built with;
From my observations small datasets like karate working perfectly fine. Just FYI. I hope it helps. Thanks |
I'm back from holiday break and will start investigating. I'll let you know how I progress. |
I have reproduced and isolated the issue. I've kicked this over to @seunghwak to investigate further. I believe the issue is that when we're built in debug mode some of the kernels use more resources, so we need to adjust the parallelism we ask for on the GPU. But Seunghwa is our expert on this, I'll let him correct this and provide the final explanation. We haven't been able to build our code in debug mode for a while... clearly some of the refactoring that we have done has made it possible to do again. Now that we can I imagine we will do more compilation in this mode and discover any other hidden issues like this. |
Hello!
I am not sure this is the correct place to ask this question.
I am getting a error like this;
We are currently developing some project on top of cugraph specifically louvain. In my colleagues PC the examples and tests that I going mention are working perfectly fine. We installed from same repo/commit, same cuda version and same os.
The examples that I tried are;
https://github.com/yigithanyigit/cugraph/blob/branch-24.12/cpp/tests/community/louvain_test.cpp
https://github.com/yigithanyigit/cugraph-template/blob/main/src/louvain.cu
Short definition of problem.
If I am working small dataset like karate there is no problem. It starts and finishes succesfully. But when I working with big datasets (ca-hollywood-2009, soc-livejournal), it initializes, after that runs ~30-40 seconds and crashes (probably at de-allocation stage).
I also ran with compute-sanitizer and got this results.
My cuda version is: 12.4
Compute capability: 86
Device: RTX 3090
OS: Ubuntu 22.04 LTS
DRIVER:550.127.08
librmm : 24.12.00a33 cuda12_241204_g3b5f6af2_33 rapidsai-nightly
rmm: 24.12.00a33 cuda12_py312_241204_g3b5f6af2_33 rapidsai-nightly
Update
Issue also occurs on 24.10
and I reinstalled OS tried with different drivers
My cuda version is: 12.6
Compute capability: 86
Device: RTX 3090
OS: Ubuntu 24.04 LTS
DRIVER:560.35.03
The text was updated successfully, but these errors were encountered: