KHARMA segfaults on Nvidia+Slingshot machines in some circumstances #46

vedantdhruv96 · 2023-11-28T20:17:11Z

The default GPU build on Delta ./make.sh cuda hdf5 loads the nvhpc_latest module. The C++ compiler then is NVHPC 22.2.0. This results in the following error while building Kokkos:

"/u/vdhruv2/kharma-next-fixes-builds/external/parthenon/external/Kokkos/core/src/impl/Kokkos_CheckedIntegerOps.hpp", line 40: error: the type of the third operand of __builtin_mul_overflow must be an integral type
  return __builtin_mul_overflow(a, b, &res); 
                                ^
          detected during instantiation of "T Kokkos::Impl::multiply_overflow_abort(T, T) [with T=std::size_t]" at line 437 of "/u/vdhruv2/kharma-next-fixes-builds/external/parthenon/external/Kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp"

"/u/vdhruv2/kharma-next-fixes-builds/external/parthenon/external/Kokkos/core/src/impl/Kokkos_CheckedIntegerOps.hpp", line 40: internal error: transform_builtin_call: generalized builtin symbol not found.
  return __builtin_mul_overflow(a, b, &res);

The Kokkos version is 4.1.99 and the C++ version is 17

The text was updated successfully, but these errors were encountered:

vedantdhruv96 · 2023-12-03T19:40:21Z

I'm able to build the code (after a wait), but it still fails to run. Get the error: [gpua030:751923:0:751923] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x23). Everything held same, an equivalent GCC build runs fine. Keeping this issue open, just "compile-time"->"runtime".

bprather · 2023-12-04T18:15:36Z

Renamed to reflect runtime nature, and that this is probably also related to not being able to use device-side MPI buffers on Delta & similar machines. Chicoma's down today but I'll take a look on both machines when I get the chance, it might be that the fix for MPI hangs clears this up or allows us to track it down better.

bprather · 2023-12-04T18:19:12Z

Also linking here that the long build has been opened as a bug in Parthenon: parthenon-hpc-lab/parthenon#922. As I mention there, will take a look this week sometime as it's silly to be avoiding our best compiler stack.

vedantdhruv96 added the bug Something isn't working label Nov 28, 2023

bprather mentioned this issue Nov 28, 2023

Release KHARMA 2023.12 #42

Merged

vedantdhruv96 mentioned this issue Dec 2, 2023

NCSA Delta: Running on multiple GPUs on the 4xA100 nodes #44

Closed

bprather changed the title ~~NCSA Delta: nvhpc build~~ KHARMA segfaults on Nvidia+Slingshot machines in some circumstances Dec 4, 2023

This was referenced Jan 24, 2024

Delta compile with Slingshot 11 and new module stack #70

Merged

Revert Kokkos by default #72

Merged

bprather linked a pull request Jan 26, 2024 that will close this issue

Revert Kokkos by default #72

Merged

bprather closed this as completed in #72 Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KHARMA segfaults on Nvidia+Slingshot machines in some circumstances #46

KHARMA segfaults on Nvidia+Slingshot machines in some circumstances #46

vedantdhruv96 commented Nov 28, 2023

vedantdhruv96 commented Dec 3, 2023

bprather commented Dec 4, 2023

bprather commented Dec 4, 2023

KHARMA segfaults on Nvidia+Slingshot machines in some circumstances #46

KHARMA segfaults on Nvidia+Slingshot machines in some circumstances #46

Comments

vedantdhruv96 commented Nov 28, 2023

vedantdhruv96 commented Dec 3, 2023

bprather commented Dec 4, 2023

bprather commented Dec 4, 2023