Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

KHARMA segfaults on Nvidia+Slingshot machines in some circumstances #46

Closed
vedantdhruv96 opened this issue Nov 28, 2023 · 3 comments · Fixed by #72
Closed

KHARMA segfaults on Nvidia+Slingshot machines in some circumstances #46

vedantdhruv96 opened this issue Nov 28, 2023 · 3 comments · Fixed by #72
Labels
bug Something isn't working

Comments

@vedantdhruv96
Copy link
Contributor

The default GPU build on Delta ./make.sh cuda hdf5 loads the nvhpc_latest module. The C++ compiler then is NVHPC 22.2.0. This results in the following error while building Kokkos:

"/u/vdhruv2/kharma-next-fixes-builds/external/parthenon/external/Kokkos/core/src/impl/Kokkos_CheckedIntegerOps.hpp", line 40: error: the type of the third operand of __builtin_mul_overflow must be an integral type
  return __builtin_mul_overflow(a, b, &res); 
                                ^
          detected during instantiation of "T Kokkos::Impl::multiply_overflow_abort(T, T) [with T=std::size_t]" at line 437 of "/u/vdhruv2/kharma-next-fixes-builds/external/parthenon/external/Kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp"

"/u/vdhruv2/kharma-next-fixes-builds/external/parthenon/external/Kokkos/core/src/impl/Kokkos_CheckedIntegerOps.hpp", line 40: internal error: transform_builtin_call: generalized builtin symbol not found.
  return __builtin_mul_overflow(a, b, &res);

The Kokkos version is 4.1.99 and the C++ version is 17

@vedantdhruv96
Copy link
Contributor Author

I'm able to build the code (after a wait), but it still fails to run. Get the error: [gpua030:751923:0:751923] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x23). Everything held same, an equivalent GCC build runs fine. Keeping this issue open, just "compile-time"->"runtime".

@bprather
Copy link
Contributor

bprather commented Dec 4, 2023

Renamed to reflect runtime nature, and that this is probably also related to not being able to use device-side MPI buffers on Delta & similar machines. Chicoma's down today but I'll take a look on both machines when I get the chance, it might be that the fix for MPI hangs clears this up or allows us to track it down better.

@bprather bprather changed the title NCSA Delta: nvhpc build KHARMA segfaults on Nvidia+Slingshot machines in some circumstances Dec 4, 2023
@bprather
Copy link
Contributor

bprather commented Dec 4, 2023

Also linking here that the long build has been opened as a bug in Parthenon: parthenon-hpc-lab/parthenon#922. As I mention there, will take a look this week sometime as it's silly to be avoiding our best compiler stack.

@bprather bprather linked a pull request Jan 26, 2024 that will close this issue
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants