Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[QST] Error on invalid device ordinal and cuda_memory_resource #4806

Open
yigithanyigit opened this issue Dec 5, 2024 · 7 comments · May be fixed by #4872
Open

[QST] Error on invalid device ordinal and cuda_memory_resource #4806

yigithanyigit opened this issue Dec 5, 2024 · 7 comments · May be fixed by #4872
Assignees
Labels
question Further information is requested

Comments

@yigithanyigit
Copy link

yigithanyigit commented Dec 5, 2024

Hello!

I am not sure this is the correct place to ask this question.

I am getting a error like this;

Thrust exception: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal
CUDA Error detected. cudaErrorInvalidValue invalid argument
louvain_example: /home/yigithan/miniconda3/envs/cugraph_dev/include/rmm/mr/device/cuda_memory_resource.hpp:80: virtual void rmm::mr::cuda_memory_resource::do_deallocate(void*, std::size_t, rmm::cuda_stream_view): Assertion `status__ == cudaSuccess' failed.

We are currently developing some project on top of cugraph specifically louvain. In my colleagues PC the examples and tests that I going mention are working perfectly fine. We installed from same repo/commit, same cuda version and same os.

The examples that I tried are;

https://github.com/yigithanyigit/cugraph/blob/branch-24.12/cpp/tests/community/louvain_test.cpp

https://github.com/yigithanyigit/cugraph-template/blob/main/src/louvain.cu

Short definition of problem.

If I am working small dataset like karate there is no problem. It starts and finishes succesfully. But when I working with big datasets (ca-hollywood-2009, soc-livejournal), it initializes, after that runs ~30-40 seconds and crashes (probably at de-allocation stage).

I also ran with compute-sanitizer and got this results.

Program hit cudaErrorLaunchOutOfResources (error 701) due to "too many resources requested for launch" on CUDA API call to cudaLaunchKernel_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4466f5]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaLaunchKernel_ptsz [0x547fd]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel in /home/yigithan/miniconda3/envs/cugraph_dev/targets/x86_64-linux/include/cuda_runtime_api.h:14030 [0xe5ecef1]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcugraph.so
=========     Host Frame:_ZL736__device_stub__ZN7cugraph6detail35per_v_transform_reduce_e_mid_degreeILb1ENS_12graph_view_tIiiLb0ELb0EvEENS0_52edge_partition_endpoint_dummy_property_device_view_tIiEES5_NS0_42edge_partition_edge_property_device_view_tIiPKffEENS6_IiPKjbEEPfZNS_71_GLOBAL__N__3530e449_32_graph_weight_utils_sg_v32_e32_cu_4d8abc56_2573119compute_weight_sumsILb1EiifLb0ELb0EEEN3rmm14device_uvectorIT2_EERKN4raft8handle_tERKNS2_IT0_T1_XT3_EXT4_EvEENS_20edge_property_view_tISP_PKSI_N6thrust15iterator_traitsISV_E10value_typeEEEEUnvdl0_PFNSH_IfEESN_RKS3_NST_IiS8_fEEESF_ILb1EiifLb0ELb0EE2_NS_9reduce_op4plusIfEEfEEvNS_28edge_partition_device_view_tINSO_11vertex_typeENSO_9edge_typeEXsrSO_12is_multi_gpuEvEES1C_S1C_SP_SI_T3_NSW_8optionalIT4_EET5_T6_T8_S1L_T7_RN7cugraph28edge_partition_device_view_tIiiLb0EvEEiiRNS_6detail52edge_partition_endpoint_dummy_property_device_view_tIiEES6_RNS3_42edge_partition_edge_property_device_view_tIiPKffEERN6thrust8optionalINS7_IiPKjbEEEEPfR17__nv_dl_wrapper_tI11__nv_dl_tagIPFN3rmm14device_uvectorIfEERKN4raft8handle_tERKNS_12graph_view_tIiiLb0ELb0EvEENS_20edge_property_view_tIiS9_fEEEXadL_ZNS_71_GLOBAL__N__3530e449_32_graph_weight_utils_sg_v32_e32_cu_4d8abc56_2573119compute_weight_sumsILb1EiifLb0ELb0EEENSN_IT2_EESS_RKNST_IT0_T1_XT3_EXT4_EvEENSX_IS16_PKS13_NSC_15iterator_traitsIS1B_E10value_typeEEEEELj2EEJEEffRNS_9reduce_op4plusIfEE in /tmp/tmpxft_00006475_00000000-6_graph_weight_utils_sg_v32_e32.cudafe1.stub.c:233 [0xe5ee3f6]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcugraph.so
.
.
.
.

My cuda version is: 12.4
Compute capability: 86
Device: RTX 3090
OS: Ubuntu 22.04 LTS
DRIVER:550.127.08

librmm : 24.12.00a33 cuda12_241204_g3b5f6af2_33 rapidsai-nightly
rmm: 24.12.00a33 cuda12_py312_241204_g3b5f6af2_33 rapidsai-nightly

Update

Issue also occurs on 24.10

and I reinstalled OS tried with different drivers

My cuda version is: 12.6
Compute capability: 86
Device: RTX 3090
OS: Ubuntu 24.04 LTS
DRIVER:560.35.03

@yigithanyigit yigithanyigit added ? - Needs Triage Need team to review and classify question Further information is requested labels Dec 5, 2024
@bdice
Copy link
Contributor

bdice commented Dec 5, 2024

Thanks for filing this! I think this might be a better fit for the cuGraph repository. The RMM failure you're observing is probably due to an earlier bug occurring in cuGraph's code. I will transfer this issue there.

cc: @ChuckHastings for awareness.

@bdice bdice transferred this issue from rapidsai/rmm Dec 5, 2024
@yigithanyigit
Copy link
Author

yigithanyigit commented Dec 10, 2024

Thank you for the response!

I think you are right! I compiled librmm from source and it works pretty fine. I tried to trace down more but I couldn't achieved not much during the process.

I am open to debug/solve ideas.

Thanks

@yigithanyigit
Copy link
Author

Ok, I am not solved the problem but I found a workaround.

The problem looks undefined behavior in some part of the code. I am having this problem if I build with debug symbols.

@ChuckHastings
Copy link
Collaborator

We have not been able to build a complete build of cugraph with debug symbols in a while (the overall code is too big). Can you share some simple code that reproduces your error? I can try and reproduce it myself and that would make it easier to try and diagnose the problem you are seeing.

@yigithanyigit
Copy link
Author

We have not been able to build a complete build of cugraph with debug symbols in a while (the overall code is too big). Can you share some simple code that reproduces your error? I can try and reproduce it myself and that would make it easier to try and diagnose the problem you are seeing.

First of all thank you for your response!

Louvain tests are failing (Rmat32, Rmat64). I assume you may test those tests.

I built with;

./build.sh libcugraph -g

From my observations small datasets like karate working perfectly fine. Just FYI.

I hope it helps.

Thanks

@ChuckHastings
Copy link
Collaborator

I'm back from holiday break and will start investigating. I'll let you know how I progress.

@ChuckHastings ChuckHastings self-assigned this Jan 6, 2025
@ChuckHastings ChuckHastings removed the ? - Needs Triage Need team to review and classify label Jan 6, 2025
@ChuckHastings ChuckHastings linked a pull request Jan 17, 2025 that will close this issue
@ChuckHastings
Copy link
Collaborator

I have reproduced and isolated the issue. I've kicked this over to @seunghwak to investigate further. I believe the issue is that when we're built in debug mode some of the kernels use more resources, so we need to adjust the parallelism we ask for on the GPU. But Seunghwa is our expert on this, I'll let him correct this and provide the final explanation.

We haven't been able to build our code in debug mode for a while... clearly some of the refactoring that we have done has made it possible to do again. Now that we can I imagine we will do more compilation in this mode and discover any other hidden issues like this.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
question Further information is requested
Projects
Status: To-do
Development

Successfully merging a pull request may close this issue.

4 participants