Improve throughput script + Results of further AOSOA tests #209

valassi · 2021-06-11T10:02:43Z

A few additional tests while preparing the vchep paper

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.267203e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.364154e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.722013 sec 2,528,979,124 cycles # 2.652 GHz 3,490,457,956 instructions # 1.38 insn per cycle 1.013854254 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.309556e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 6.971825 sec 20,204,537,861 cycles # 2.674 GHz 48,560,277,826 instructions # 2.40 insn per cycle 6.979984273 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 2.534958e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.659099 sec 14,025,995,113 cycles # 2.672 GHz 30,075,975,631 instructions # 2.14 insn per cycle 4.667759564 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.611130e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.469995 sec 10,299,552,854 cycles # 2.536 GHz 16,693,116,256 instructions # 1.62 insn per cycle 3.478566716 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.925171e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.400321 sec 10,168,815,414 cycles # 2.542 GHz 16,278,373,573 instructions # 1.60 insn per cycle 3.408559417 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 3.726243e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.828006 sec 9,961,706,306 cycles # 2.250 GHz 13,142,521,259 instructions # 1.32 insn per cycle 3.836258066 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) ========================================================================= On pmpe04.cern.ch [CPU: Intel(R) Xeon(R) CPU E5-2630 v3] [GPU: none]: ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 1.305101e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 6.830661 sec 21,856,861,877 cycles:u # 2.811 GHz 48,486,844,412 instructions:u # 2.22 insn per cycle 6.838002353 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 2.234273e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.877238 sec 15,542,989,077 cycles:u # 2.668 GHz 30,005,506,418 instructions:u # 1.93 insn per cycle 4.884807714 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.039891e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.603382 sec 11,649,111,915 cycles:u # 2.551 GHz 16,629,656,104 instructions:u # 1.43 insn per cycle 3.610842924 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- ERROR! The application is built for skylake-avx512 (AVX512VL) but the host does not support it 2,155,744 cycles:u # 0.697 GHz 2,347,155 instructions:u # 1.09 insn per cycle 0.005606592 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- ERROR! The application is built for skylake-avx512 (AVX512VL) but the host does not support it 1,984,218 cycles:u # 0.367 GHz 2,347,081 instructions:u # 1.18 insn per cycle 0.007985069 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) =========================================================================

…the best) On bmk32-cc7-rhl4dhp74r.cern.ch [CPU: Intel(R) Xeon(R) Gold 6130 CPU] [GPU: none]: ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 1.319663e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.106917 sec real 0m7.136s =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 2.534639e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.812357 sec real 0m4.841s =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.685652e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.651054 sec real 0m3.679s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.822843e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.655420 sec real 0m3.685s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.634158e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.676052 sec real 0m3.706s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) ========================================================================= NB: code compiled on itscrd70 and copied over as-is srcdir=/afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpuBis/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum cd /data/avalassi/GPU2020/gold mkdir -p epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum \cp $srcdir/throughput12.sh . \cp $srcdir/simdSymSummary.sh . for f in $srcdir/build*/check.exe $srcdir/build*/CPPProcess.o; do echo $f; d=$(basename $(dirname $f)); mkdir -p $d; \cp $f $d; done mkdir -p ../../Cards/ \cp $srcdir/../../Cards/param_card.dat ../../Cards/param_card.dat \cp /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/lib64/libstdc++.so.6 . \cp /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/lib64/libgomp.so.1 . \cp /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/lib64/libgcc_s.so.1 . cd ../../../../../..; tar -cvf gold.tar gold; cd - scp ../../../../../../gold.tar avalassi@bmk32-cc7-rhl4dhp74r:/home/avalassi

… perf On bmk32-cc7-rhl4dhp74r.cern.ch [CPU: Intel(R) Xeon(R) Gold 6130 CPU] [GPU: none]: ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 1.316104e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.148966 sec 21,927,198,373 cycles # 2.754 GHz 49,099,019,022 instructions # 2.24 insn per cycle 7.177090888 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 2.526108e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.851482 sec 15,621,448,969 cycles # 2.757 GHz 30,607,518,127 instructions # 1.96 insn per cycle 4.879296659 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.675595e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.661612 sec 11,770,617,273 cycles # 2.637 GHz 17,226,922,814 instructions # 1.46 insn per cycle 3.689494234 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.850632e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.637422 sec 11,844,213,793 cycles # 2.641 GHz 16,812,019,185 instructions # 1.42 insn per cycle 3.665992532 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.655938e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.684760 sec 11,173,811,979 cycles # 2.485 GHz 13,674,017,211 instructions # 1.22 insn per cycle 3.712847551 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.308379e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.000368 sec 7,563.42 msec task-clock # 1.079 CPUs utilized 92 context-switches # 0.012 K/sec 34 cpu-migrations # 0.004 K/sec 6,623 page-faults # 0.876 K/sec 20,206,016,940 cycles # 2.672 GHz 48,559,476,997 instructions # 2.40 insn per cycle 1,617,798,346 branches # 213.898 M/sec 40,829,916 branch-misses # 2.52% of all branches 12,737,716,160 L1-dcache-loads # 1684.122 M/sec 129,760,167 L1-dcache-load-misses # 1.02% of all L1-dcache hits 7.009693745 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 2.533888e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.697803 sec 5,265.27 msec task-clock # 1.119 CPUs utilized 86 context-switches # 0.016 K/sec 27 cpu-migrations # 0.005 K/sec 6,623 page-faults # 0.001 M/sec 14,055,803,585 cycles # 2.670 GHz 30,070,489,639 instructions # 2.14 insn per cycle 1,358,524,691 branches # 258.016 M/sec 40,848,534 branch-misses # 3.01% of all branches 7,557,353,591 L1-dcache-loads # 1435.321 M/sec 122,717,179 L1-dcache-load-misses # 1.62% of all L1-dcache hits 4.706821960 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.603137e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.475700 sec 4,067.94 msec task-clock # 1.167 CPUs utilized 79 context-switches # 0.019 K/sec 32 cpu-migrations # 0.008 K/sec 7,133 page-faults # 0.002 M/sec 10,306,891,498 cycles # 2.534 GHz 16,693,240,427 instructions # 1.62 insn per cycle 1,237,173,420 branches # 304.127 M/sec 40,830,545 branch-misses # 3.30% of all branches 5,176,991,254 L1-dcache-loads # 1272.631 M/sec 101,854,732 L1-dcache-load-misses # 1.97% of all L1-dcache hits 3.485010692 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.911909e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.415136 sec 4,010.12 msec task-clock # 1.171 CPUs utilized 126 context-switches # 0.031 K/sec 33 cpu-migrations # 0.008 K/sec 6,624 page-faults # 0.002 M/sec 10,189,296,065 cycles # 2.541 GHz 16,274,934,090 instructions # 1.60 insn per cycle 1,135,981,115 branches # 283.279 M/sec 40,929,587 branch-misses # 3.60% of all branches 3,722,826,658 L1-dcache-loads # 928.358 M/sec 101,817,067 L1-dcache-load-misses # 2.73% of all L1-dcache hits 3.424371340 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 3.727578e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.846731 sec 4,425.39 msec task-clock # 1.148 CPUs utilized 87 context-switches # 0.020 K/sec 35 cpu-migrations # 0.008 K/sec 8,157 page-faults # 0.002 M/sec 9,967,056,024 cycles # 2.252 GHz 13,145,760,375 instructions # 1.32 insn per cycle 1,080,089,749 branches # 244.066 M/sec 41,016,506 branch-misses # 3.80% of all branches 2,849,996,485 L1-dcache-loads # 644.010 M/sec 95,185,042 L1-dcache-load-misses # 3.34% of all L1-dcache hits 3.855786619 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.237518e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.361609e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.725189 sec 2,549,736,022 cycles # 2.650 GHz 3,503,747,691 instructions # 1.37 insn per cycle 1.023542621 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,444 [-p 2048 256 1] =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[1] == AOS [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.219148e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351486e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371810e-02 +- 3.270123e-06 ) GeV^0 TOTAL : 1.185026 sec 3,343,282,112 cycles # 2.645 GHz 4,786,320,070 instructions # 1.43 insn per cycle 1.484666756 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 40 [-p 1 4 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,244,176 [-p 2048 256 1] =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[1] == AOS [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[1] == AOS EvtsPerSec[MatrixElems] (3) = ( 6.972024e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.266853e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371810e-02 +- 3.270123e-06 ) GeV^0 TOTAL : 1.128006 sec 3,365,367,187 cycles # 2.651 GHz 4,800,781,439 instructions # 1.43 insn per cycle 1.425973157 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 40 [-p 1 1 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 160 [-p 1 4 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 1,280 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 20,971,007 [-p 2048 256 1] =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[1] == AOS EvtsPerSec[MatrixElems] (3) = ( 6.997163e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.272792e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.208810 sec 3,334,791,948 cycles # 2.652 GHz 4,801,124,639 instructions # 1.44 insn per cycle 1.502558668 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 1,280 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 20,971,817 [-p 2048 256 1] =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[8] EvtsPerSec[MatrixElems] (3) = ( 7.230897e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.349811e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.161124 sec 3,327,628,044 cycles # 2.650 GHz 4,775,011,660 instructions # 1.43 insn per cycle 1.455159065 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,514 [-p 2048 256 1] =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.227665e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.352993e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.119368 sec 3,329,955,135 cycles # 2.649 GHz 4,782,880,833 instructions # 1.44 insn per cycle 1.414098430 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,035 [-p 2048 256 1] =========================================================================

This is because the kernel needs the four-momenta of all four particles, times four helicities (64) Instead of two four-momenta and two pz momenta, times four helicities (40) On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 5.129781e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 7.643036e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.223921 sec 3,081,196,662 cycles # 2.651 GHz 4,394,808,734 instructions # 1.43 insn per cycle 1.521394982 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 136 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 64 sectors 128 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 64 sectors 512 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 1,048,576 sectors 8,388,608 [-p 2048 256 1] =========================================================================

This is the result of a TEST with a single helicity: On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 1.081427e+09 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 3.581359e+09 ) sec^-1 MeanMatrixElemValue = ( 4.963658e-03 +- 1.770609e-06 ) GeV^0 TOTAL : 0.930117 sec 2,957,532,161 cycles # 2.651 GHz 4,187,693,693 instructions # 1.42 insn per cycle 1.227039609 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 98 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 10 sectors 20 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 10 sectors 80 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 163,840 sectors 1,310,720 [-p 2048 256 1] =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.043865e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351422e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.723324 sec 2,546,454,804 cycles # 2.656 GHz 3,488,166,591 instructions # 1.37 insn per cycle 1.020129392 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,242,606 [-p 2048 256 1] =========================================================================

valassi · 2021-06-11T10:03:53Z

Will self merge

valassi added 18 commits May 31, 2021 10:45

[tput] dump CPU name in throughput12.sh

8b32110

[tput] fix throughput12.sh to address no-gpu cases

1123c7f

Merge remote-tracking branch 'upstream/master' into tput

45db29b

[tput] add -req option to throughput12.sh to show sectors/requests

fb6a64f

[tput] throughput12.sh better printouts of sectors/requests (use 1 32 1)

034c8b1

valassi merged commit 8a7c494 into madgraph5:master Jun 11, 2021

valassi mentioned this pull request Jun 11, 2021

AOS/SOA for input particle 4-momenta (and random numbers) #16

Closed

valassi mentioned this pull request Jul 25, 2021

More complete analysis of AVX512 in both gcc and clang #173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve throughput script + Results of further AOSOA tests #209

Improve throughput script + Results of further AOSOA tests #209

valassi commented Jun 11, 2021

valassi commented Jun 11, 2021

Improve throughput script + Results of further AOSOA tests #209

Improve throughput script + Results of further AOSOA tests #209

Conversation

valassi commented Jun 11, 2021

valassi commented Jun 11, 2021