[QST] Error on invalid device ordinal and cuda_memory_resource #4806

yigithanyigit · 2024-12-05T00:30:20Z

Hello!

I am not sure this is the correct place to ask this question.

I am getting a error like this;

Thrust exception: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal
CUDA Error detected. cudaErrorInvalidValue invalid argument
louvain_example: /home/yigithan/miniconda3/envs/cugraph_dev/include/rmm/mr/device/cuda_memory_resource.hpp:80: virtual void rmm::mr::cuda_memory_resource::do_deallocate(void*, std::size_t, rmm::cuda_stream_view): Assertion `status__ == cudaSuccess' failed.

We are currently developing some project on top of cugraph specifically louvain. In my colleagues PC the examples and tests that I going mention are working perfectly fine. We installed from same repo/commit, same cuda version and same os.

The examples that I tried are;

https://github.com/yigithanyigit/cugraph/blob/branch-24.12/cpp/tests/community/louvain_test.cpp

https://github.com/yigithanyigit/cugraph-template/blob/main/src/louvain.cu

Short definition of problem.

If I am working small dataset like karate there is no problem. It starts and finishes succesfully. But when I working with big datasets (ca-hollywood-2009, soc-livejournal), it initializes, after that runs ~30-40 seconds and crashes (probably at de-allocation stage).

I also ran with compute-sanitizer and got this results.

Program hit cudaErrorLaunchOutOfResources (error 701) due to "too many resources requested for launch" on CUDA API call to cudaLaunchKernel_ptsz.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4466f5]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaLaunchKernel_ptsz [0x547fd]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel in /home/yigithan/miniconda3/envs/cugraph_dev/targets/x86_64-linux/include/cuda_runtime_api.h:14030 [0xe5ecef1]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcugraph.so
=========     Host Frame:_ZL736__device_stub__ZN7cugraph6detail35per_v_transform_reduce_e_mid_degreeILb1ENS_12graph_view_tIiiLb0ELb0EvEENS0_52edge_partition_endpoint_dummy_property_device_view_tIiEES5_NS0_42edge_partition_edge_property_device_view_tIiPKffEENS6_IiPKjbEEPfZNS_71_GLOBAL__N__3530e449_32_graph_weight_utils_sg_v32_e32_cu_4d8abc56_2573119compute_weight_sumsILb1EiifLb0ELb0EEEN3rmm14device_uvectorIT2_EERKN4raft8handle_tERKNS2_IT0_T1_XT3_EXT4_EvEENS_20edge_property_view_tISP_PKSI_N6thrust15iterator_traitsISV_E10value_typeEEEEUnvdl0_PFNSH_IfEESN_RKS3_NST_IiS8_fEEESF_ILb1EiifLb0ELb0EE2_NS_9reduce_op4plusIfEEfEEvNS_28edge_partition_device_view_tINSO_11vertex_typeENSO_9edge_typeEXsrSO_12is_multi_gpuEvEES1C_S1C_SP_SI_T3_NSW_8optionalIT4_EET5_T6_T8_S1L_T7_RN7cugraph28edge_partition_device_view_tIiiLb0EvEEiiRNS_6detail52edge_partition_endpoint_dummy_property_device_view_tIiEES6_RNS3_42edge_partition_edge_property_device_view_tIiPKffEERN6thrust8optionalINS7_IiPKjbEEEEPfR17__nv_dl_wrapper_tI11__nv_dl_tagIPFN3rmm14device_uvectorIfEERKN4raft8handle_tERKNS_12graph_view_tIiiLb0ELb0EvEENS_20edge_property_view_tIiS9_fEEEXadL_ZNS_71_GLOBAL__N__3530e449_32_graph_weight_utils_sg_v32_e32_cu_4d8abc56_2573119compute_weight_sumsILb1EiifLb0ELb0EEENSN_IT2_EESS_RKNST_IT0_T1_XT3_EXT4_EvEENSX_IS16_PKS13_NSC_15iterator_traitsIS1B_E10value_typeEEEEELj2EEJEEffRNS_9reduce_op4plusIfEE in /tmp/tmpxft_00006475_00000000-6_graph_weight_utils_sg_v32_e32.cudafe1.stub.c:233 [0xe5ee3f6]
=========                in /home/yigithan/miniconda3/envs/cugraph_dev/lib/libcugraph.so
.
.
.
.

My cuda version is: 12.4
Compute capability: 86
Device: RTX 3090
OS: Ubuntu 22.04 LTS
DRIVER:550.127.08

librmm : 24.12.00a33 cuda12_241204_g3b5f6af2_33 rapidsai-nightly
rmm: 24.12.00a33 cuda12_py312_241204_g3b5f6af2_33 rapidsai-nightly

Update

Issue also occurs on 24.10

and I reinstalled OS tried with different drivers

My cuda version is: 12.6
Compute capability: 86
Device: RTX 3090
OS: Ubuntu 24.04 LTS
DRIVER:560.35.03

The text was updated successfully, but these errors were encountered:

bdice · 2024-12-05T18:02:26Z

Thanks for filing this! I think this might be a better fit for the cuGraph repository. The RMM failure you're observing is probably due to an earlier bug occurring in cuGraph's code. I will transfer this issue there.

cc: @ChuckHastings for awareness.

yigithanyigit · 2024-12-10T16:26:54Z

Thank you for the response!

I think you are right! I compiled librmm from source and it works pretty fine. I tried to trace down more but I couldn't achieved not much during the process.

I am open to debug/solve ideas.

Thanks

yigithanyigit · 2024-12-10T21:03:38Z

Ok, I am not solved the problem but I found a workaround.

The problem looks undefined behavior in some part of the code. I am having this problem if I build with debug symbols.

ChuckHastings · 2024-12-11T00:26:50Z

We have not been able to build a complete build of cugraph with debug symbols in a while (the overall code is too big). Can you share some simple code that reproduces your error? I can try and reproduce it myself and that would make it easier to try and diagnose the problem you are seeing.

yigithanyigit · 2024-12-11T00:49:20Z

We have not been able to build a complete build of cugraph with debug symbols in a while (the overall code is too big). Can you share some simple code that reproduces your error? I can try and reproduce it myself and that would make it easier to try and diagnose the problem you are seeing.

First of all thank you for your response!

Louvain tests are failing (Rmat32, Rmat64). I assume you may test those tests.

I built with;

./build.sh libcugraph -g

From my observations small datasets like karate working perfectly fine. Just FYI.

I hope it helps.

Thanks

ChuckHastings · 2025-01-06T19:41:30Z

I'm back from holiday break and will start investigating. I'll let you know how I progress.

ChuckHastings · 2025-01-17T18:30:26Z

I have reproduced and isolated the issue. I've kicked this over to @seunghwak to investigate further. I believe the issue is that when we're built in debug mode some of the kernels use more resources, so we need to adjust the parallelism we ask for on the GPU. But Seunghwa is our expert on this, I'll let him correct this and provide the final explanation.

We haven't been able to build our code in debug mode for a while... clearly some of the refactoring that we have done has made it possible to do again. Now that we can I imagine we will do more compilation in this mode and discover any other hidden issues like this.

yigithanyigit added ? - Needs Triage Need team to review and classify question Further information is requested labels Dec 5, 2024

bdice transferred this issue from rapidsai/rmm Dec 5, 2024

ChuckHastings self-assigned this Jan 6, 2025

ChuckHastings removed the ? - Needs Triage Need team to review and classify label Jan 6, 2025

ChuckHastings linked a pull request Jan 17, 2025 that will close this issue

Correct issues in our debug build. #4872

Draft

ChuckHastings assigned seunghwak Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Error on invalid device ordinal and cuda_memory_resource #4806

[QST] Error on invalid device ordinal and cuda_memory_resource #4806

yigithanyigit commented Dec 5, 2024 •

edited

Loading

bdice commented Dec 5, 2024

yigithanyigit commented Dec 10, 2024 •

edited

Loading

yigithanyigit commented Dec 10, 2024

ChuckHastings commented Dec 11, 2024

yigithanyigit commented Dec 11, 2024

ChuckHastings commented Jan 6, 2025

ChuckHastings commented Jan 17, 2025

[QST] Error on invalid device ordinal and cuda_memory_resource #4806

[QST] Error on invalid device ordinal and cuda_memory_resource #4806

Comments

yigithanyigit commented Dec 5, 2024 • edited Loading

Short definition of problem.

Update

bdice commented Dec 5, 2024

yigithanyigit commented Dec 10, 2024 • edited Loading

yigithanyigit commented Dec 10, 2024

ChuckHastings commented Dec 11, 2024

yigithanyigit commented Dec 11, 2024

ChuckHastings commented Jan 6, 2025

ChuckHastings commented Jan 17, 2025

yigithanyigit commented Dec 5, 2024 •

edited

Loading

yigithanyigit commented Dec 10, 2024 •

edited

Loading