Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

cuda aware run of deltawing case fails on perlmutter #89

Open
cwsmith opened this issue Mar 6, 2024 · 0 comments
Open

cuda aware run of deltawing case fails on perlmutter #89

cwsmith opened this issue Mar 6, 2024 · 0 comments

Comments

@cwsmith
Copy link

cwsmith commented Mar 6, 2024

environment

$ module li

Currently Loaded Modules:
  1) craype-x86-milan     3) craype-network-ofi                      5) PrgEnv-gnu/8.5.0   7) cray-libsci/23.12.5   9) craype/2.7.30    11) perftools-base/23.12.0  13) cudatoolkit/12.2       15) gpu/1.0
  2) libfabric/1.15.2.0   4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta   6) cray-dsmml/0.2.2   8) cray-mpich/8.1.28    10) gcc-native/12.3  12) cpe/23.12               14) craype-accel-nvidia80

versions

  • Omega_h: scorec/omega_h master @ 7a39707
  • Kokkos: kokkos/kokkos master @ e0dc0128e

build

$ cat doConfigPerlKk.sh 
bdir=$PWD/build-kokkos
cmake -S kokkos -B $bdir \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=ON \
  -DCRAYPE_LINK_TYPE=dynamic \
  -DCMAKE_CXX_COMPILER=$PWD/kokkos/bin/nvcc_wrapper \
  -DKokkos_ARCH_AMPERE80=ON \
  -DKokkos_ENABLE_SERIAL=ON \
  -DKokkos_ENABLE_OPENMP=off \
  -DKokkos_ENABLE_CUDA=on \
  -DKokkos_ENABLE_CUDA_LAMBDA=on \
  -DKokkos_ENABLE_DEBUG=off \
  -DCMAKE_INSTALL_PREFIX=$bdir/install
$ cat doConfigPerlOmegah.sh 
#!/bin/bash -ex

usage="Usage: $0  <mpi=on|off> <cudaAware=on|off>"
[[ $# -ne 2 ]] && echo $usage && exit 1

mpi=$1
[[ $mpi != "on" && $mpi != "off" ]] && echo $usage && exit 1

cudaAware=$2
[[ $cudaAware != "on" && $cudaAware != "off" ]] && echo $usage && exit 1

bdir=$PWD/build-omegah-mpi${mpi}-cudaAware${cudaAware}
cmake -S omega_h -B $bdir \
  -DCMAKE_INSTALL_PREFIX=$bdir/install \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=on \
  -DOmega_h_USE_Kokkos=on \
  -DOmega_h_CUDA_ARCH=80 \
  -DOmega_h_USE_MPI=$mpi \
  -DOmega_h_USE_CUDA_AWARE_MPI=$cudaAware \
  -DBUILD_TESTING=on \
  -DCMAKE_CXX_COMPILER=CC

run

Download the Omega_h delta wing meshes: https://zenodo.org/records/10672130

$ cat submitP2.sh
sbatch --nodes 1 --qos regular --time 00:10:00 --constraint gpu --gpus 4 --account=PROJECT_NAME ./runP2.sh
$ cat runP2.sh
#!/bin/bash
bin_cudaAwareOff=/pscratch/sd/c/cwsmith/omegahDeltaWingAdapt/twoGpus/build-omegah-mpion-cudaAwareoff/src
bin_cudaAwareOn=/pscratch/sd/c/cwsmith/omegahDeltaWingAdapt/twoGpus/build-omegah-mpion-cudaAwareon/src
mesh=/pscratch/sd/c/cwsmith/omegahDeltaWingAdapt/twoGpus/deltaWing_500kMetric_p2.osh

cmd="$bin_cudaAwareOff/ugawg_hsc_oshmeshload --osh-pool $mesh"
export MPICH_GPU_SUPPORT_ENABLED=0
set -x
srun -n 2 $cmd &> log2p_cudaAwareOff
set +x

cmd="$bin_cudaAwareOn/ugawg_hsc_oshmeshload --osh-pool $mesh"
export MPICH_GPU_SUPPORT_ENABLED=1
set -x
srun -n 2 $cmd &> log2p_cudaAwareOn
set +x

error

$ cat log2p_cudaAwareOn
(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 148
MPICH ERROR [Rank 0] [job id 22622708.1] [Wed Mar  6 07:48:56 2024] [nid002241] - Abort(606713346) (rank 0 in comm 0): Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(161)......................: MPI_Isend(buf=0x623196f88, count=2382, MPI_INT, dest=1, tag=42, comm=0xc4000000, request=0x23c3f34) failed
MPID_Isend(584)......................: 
MPIDI_isend_unsafe(136)..............: 
MPIDI_SHM_mpi_isend(323).............: 
MPIDI_CRAY_Common_lmt_isend(84)......: 
MPIDI_CRAY_Common_lmt_export_mem(103): 
(unknown)(): Invalid count

aborting job:
Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(161)......................: MPI_Isend(buf=0x623196f88, count=2382, MPI_INT, dest=1, tag=42, comm=0xc4000000, request=0x23c3f34) failed
MPID_Isend(584)......................: 
MPIDI_isend_unsafe(136)..............: 
MPIDI_SHM_mpi_isend(323).............: 
MPIDI_CRAY_Common_lmt_isend(84)......: 
MPIDI_CRAY_Common_lmt_export_mem(103): 
(unknown)(): Invalid count
Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
srun: error: nid002241: task 0: Exited with exit code 255
srun: Terminating StepId=22622708.1
slurmstepd: error: *** STEP 22622708.1 ON nid002241 CANCELLED AT 2024-03-06T15:48:58 ***
srun: error: nid002241: task 1: Terminated
srun: Force Terminated StepId=22622708.1
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

1 participant