Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

periodic_test fails #81

Open
cwsmith opened this issue Jan 24, 2024 · 4 comments
Open

periodic_test fails #81

cwsmith opened this issue Jan 24, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@cwsmith
Copy link

cwsmith commented Jan 24, 2024

The periodic_test with a build of master with the Kokkos Serial backend fails with a seg fault. Below is the output of valgrind from one of the two processes; the other process had a similar trace.

Omega_h cmake args:

$ cat Omega_h_cmake_args.txt
-DBUILD_TESTING:BOOL="on" -DBUILD_SHARED_LIBS:BOOL="on" -DCMAKE_INSTALL_PREFIX:PATH="/space/cwsmith/omegahKkVersions/buildOmegahSimKokkosSerialMpion_master/install" -DOmega_h_USE_Kokkos:BOOL="on" -DKokkos_PREFIX:PATH="/space/cwsmith/omegahKkVersions/buildKokkos/install" -DOmega_h_USE_SimModSuite:BOOL="on" -DOmega_h_USE_MPI:BOOL="on" -DOmega_h_USE_MPI:BOOL="on" -DOmega_h_USE_Kokkos:BOOL="on" -DKokkos_PREFIX:PATH="/space/cwsmith/omegahKkVersions/buildKokkos/install" -DOmega_h_USE_MPI:BOOL="on" -DOmega_h_USE_OpenMP:BOOL="OFF" -DOmega_h_USE_CUDA:BOOL="OFF"

Versions

omegah - master @ c5f1dc9d
kokkos - develop @ ed08974c7 (newer than last tagged version of 4.2.00)
simmetrix simmodsuite - 2023.1-230907dev

Valgrind output:

==3612296== Memcheck, a memory error detector
==3612296== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3612296== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3612296== Command: ./src/periodic_test /space/cwsmith/omegahKkVersions/omega_h_master/meshes/wedge_matchZ_12elem.sms /space/cwsmith/omegahKkVersions/omega_h_master/meshes/wedge_match.smd /space/cwsmith/omegahKkVersions/omega_h_master/meshes/wedge_matchZ_12elem_sync_2.osh 2
==3612296== Parent PID: 3612294
==3612296==
==3612296== Invalid read of size 4
==3612296==    at 0x6654270: host_atomic_fetch_oper<desul::Impl::sub_operator<int, int const>, int, desul::MemoryOrderRelaxed> (Fetch_Op_ScopeCaller.hpp:44)
==3612296==    by 0x6654270: host_atomic_fetch_sub<int, desul::MemoryOrderRelaxed, desul::MemoryScopeCaller> (Fetch_Op_Generic.hpp:40)
==3612296==    by 0x6654270: atomic_fetch_sub<int, desul::MemoryOrderRelaxed, desul::MemoryScopeCaller> (Generic.hpp:60)
==3612296==    by 0x6654270: atomic_fetch_sub<int> (Kokkos_Atomics_Desul_Wrapper.hpp:83)
==3612296==    by 0x6654270: Kokkos::Impl::SharedAllocationRecord<void, void>::decrement(Kokkos::Impl::SharedAllocationRecord<void, void>*) (Kokkos_SharedAlloc.cpp:212)
==3612296==    by 0x5213382: assign_direct (Kokkos_SharedAlloc.hpp:477)
==3612296==    by 0x5213382: Kokkos::Impl::ViewTracker<Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > >::operator=(Kokkos::Impl::ViewTracker<Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > > const&) (Kokkos_ViewTracker.hpp:79)
==3612296==    by 0x521076E: Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::operator=(Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&) (Kokkos_View.hpp:1288)
==3612296==    by 0x520BA08: Omega_h::Write<int>::operator=(Omega_h::Write<int> const&) (Omega_h_array.hpp:49)
==3612296==    by 0x5221F08: Omega_h::Read<int>::operator=(Omega_h::Read<int> const&) (Omega_h_array.hpp:88)
==3612296==    by 0x5451023: Omega_h::Mesh::copy_meta() const (Omega_h_mesh.cpp:1235)
==3612296==    by 0x54BE3C9: Omega_h::migrate_mesh(Omega_h::Mesh*, Omega_h::Dist, Omega_h_Parting, bool) (Omega_h_migrate.cpp:383)
==3612296==    by 0x544D863: Omega_h::Mesh::balance(bool) (Omega_h_mesh.cpp:956)
==3612296==    by 0x41CFCF: main (periodic_test.cpp:61)
==3612296==  Address 0x38 is not stack'd, malloc'd or (recently) free'd
==3612296==
==3612296==
==3612296== Process terminating with default action of signal 11 (SIGSEGV)
==3612296==  Access not within mapped region at address 0x38
==3612296==    at 0x6654270: host_atomic_fetch_oper<desul::Impl::sub_operator<int, int const>, int, desul::MemoryOrderRelaxed> (Fetch_Op_ScopeCaller.hpp:44)
==3612296==    by 0x6654270: host_atomic_fetch_sub<int, desul::MemoryOrderRelaxed, desul::MemoryScopeCaller> (Fetch_Op_Generic.hpp:40)
==3612296==    by 0x6654270: atomic_fetch_sub<int, desul::MemoryOrderRelaxed, desul::MemoryScopeCaller> (Generic.hpp:60)
==3612296==    by 0x6654270: atomic_fetch_sub<int> (Kokkos_Atomics_Desul_Wrapper.hpp:83)
==3612296==    by 0x6654270: Kokkos::Impl::SharedAllocationRecord<void, void>::decrement(Kokkos::Impl::SharedAllocationRecord<void, void>*) (Kokkos_SharedAlloc.cpp:212)
==3612296==    by 0x5213382: assign_direct (Kokkos_SharedAlloc.hpp:477)
==3612296==    by 0x5213382: Kokkos::Impl::ViewTracker<Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > >::operator=(Kokkos::Impl::ViewTracker<Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > > const&) (Kokkos_ViewTracker.hpp:79)
==3612296==    by 0x521076E: Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::operator=(Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&) (Kokkos_View.hpp:1288)
==3612296==    by 0x520BA08: Omega_h::Write<int>::operator=(Omega_h::Write<int> const&) (Omega_h_array.hpp:49)
==3612296==    by 0x5221F08: Omega_h::Read<int>::operator=(Omega_h::Read<int> const&) (Omega_h_array.hpp:88)
==3612296==    by 0x5451023: Omega_h::Mesh::copy_meta() const (Omega_h_mesh.cpp:1235)
==3612296==    by 0x54BE3C9: Omega_h::migrate_mesh(Omega_h::Mesh*, Omega_h::Dist, Omega_h_Parting, bool) (Omega_h_migrate.cpp:383)
==3612296==    by 0x544D863: Omega_h::Mesh::balance(bool) (Omega_h_mesh.cpp:956)
==3612296==    by 0x41CFCF: main (periodic_test.cpp:61)
==3612296==  If you believe this happened as a result of a stack
==3612296==  overflow in your program's main thread (unlikely but
==3612296==  possible), you can try to increase the size of the
==3612296==  main thread stack using the --main-stacksize= flag.
==3612296==  The main thread stack size used in this run was 8388608.
==3612296==
==3612296== HEAP SUMMARY:
==3612296==     in use at exit: 13,116,178 bytes in 4,205 blocks
==3612296==   total heap usage: 15,374 allocs, 11,169 frees, 14,496,121 bytes allocated
==3612296==
==3612296== LEAK SUMMARY:
==3612296==    definitely lost: 0 bytes in 0 blocks
==3612296==    indirectly lost: 0 bytes in 0 blocks
==3612296==      possibly lost: 10,525 bytes in 206 blocks
==3612296==    still reachable: 13,105,653 bytes in 3,999 blocks
==3612296==         suppressed: 0 bytes in 0 blocks
==3612296== Rerun with --leak-check=full to see details of leaked memory
==3612296==
==3612296== For lists of detected and suppressed errors, rerun with: -s
==3612296== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
@cwsmith cwsmith added the bug Something isn't working label Jan 24, 2024
cwsmith added a commit that referenced this issue Jan 24, 2024
@joshia5
Copy link
Collaborator

joshia5 commented Feb 1, 2024

At the time of development, the test passed with cuda backend and did not show any errors when running valgrind

@joshia5
Copy link
Collaborator

joshia5 commented Feb 1, 2024

starting point of debugging would be to debug or 'step' into the "migrate_matches" routine, I am not sure when I'll be able to replicate and work on fixing this issue

@joshia5
Copy link
Collaborator

joshia5 commented Feb 1, 2024

@cwsmith is it possible this is a new kokkos/gpu-backend issue?

@cwsmith
Copy link
Author

cwsmith commented Feb 1, 2024

Good question. I can check that.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants