-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Improve throughput script + Results of further AOSOA tests #209
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.267203e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.364154e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.722013 sec 2,528,979,124 cycles # 2.652 GHz 3,490,457,956 instructions # 1.38 insn per cycle 1.013854254 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.309556e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 6.971825 sec 20,204,537,861 cycles # 2.674 GHz 48,560,277,826 instructions # 2.40 insn per cycle 6.979984273 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 2.534958e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.659099 sec 14,025,995,113 cycles # 2.672 GHz 30,075,975,631 instructions # 2.14 insn per cycle 4.667759564 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.611130e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.469995 sec 10,299,552,854 cycles # 2.536 GHz 16,693,116,256 instructions # 1.62 insn per cycle 3.478566716 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.925171e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.400321 sec 10,168,815,414 cycles # 2.542 GHz 16,278,373,573 instructions # 1.60 insn per cycle 3.408559417 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 3.726243e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.828006 sec 9,961,706,306 cycles # 2.250 GHz 13,142,521,259 instructions # 1.32 insn per cycle 3.836258066 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) ========================================================================= On pmpe04.cern.ch [CPU: Intel(R) Xeon(R) CPU E5-2630 v3] [GPU: none]: ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 1.305101e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 6.830661 sec 21,856,861,877 cycles:u # 2.811 GHz 48,486,844,412 instructions:u # 2.22 insn per cycle 6.838002353 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 2.234273e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.877238 sec 15,542,989,077 cycles:u # 2.668 GHz 30,005,506,418 instructions:u # 1.93 insn per cycle 4.884807714 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.039891e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.603382 sec 11,649,111,915 cycles:u # 2.551 GHz 16,629,656,104 instructions:u # 1.43 insn per cycle 3.610842924 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- ERROR! The application is built for skylake-avx512 (AVX512VL) but the host does not support it 2,155,744 cycles:u # 0.697 GHz 2,347,155 instructions:u # 1.09 insn per cycle 0.005606592 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- ERROR! The application is built for skylake-avx512 (AVX512VL) but the host does not support it 1,984,218 cycles:u # 0.367 GHz 2,347,081 instructions:u # 1.18 insn per cycle 0.007985069 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) =========================================================================
…the best) On bmk32-cc7-rhl4dhp74r.cern.ch [CPU: Intel(R) Xeon(R) Gold 6130 CPU] [GPU: none]: ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 1.319663e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.106917 sec real 0m7.136s =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 2.534639e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.812357 sec real 0m4.841s =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.685652e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.651054 sec real 0m3.679s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.822843e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.655420 sec real 0m3.685s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.634158e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.676052 sec real 0m3.706s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) ========================================================================= NB: code compiled on itscrd70 and copied over as-is srcdir=/afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpuBis/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum cd /data/avalassi/GPU2020/gold mkdir -p epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum \cp $srcdir/throughput12.sh . \cp $srcdir/simdSymSummary.sh . for f in $srcdir/build*/check.exe $srcdir/build*/CPPProcess.o; do echo $f; d=$(basename $(dirname $f)); mkdir -p $d; \cp $f $d; done mkdir -p ../../Cards/ \cp $srcdir/../../Cards/param_card.dat ../../Cards/param_card.dat \cp /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/lib64/libstdc++.so.6 . \cp /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/lib64/libgomp.so.1 . \cp /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/lib64/libgcc_s.so.1 . cd ../../../../../..; tar -cvf gold.tar gold; cd - scp ../../../../../../gold.tar avalassi@bmk32-cc7-rhl4dhp74r:/home/avalassi
… perf On bmk32-cc7-rhl4dhp74r.cern.ch [CPU: Intel(R) Xeon(R) Gold 6130 CPU] [GPU: none]: ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 1.316104e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.148966 sec 21,927,198,373 cycles # 2.754 GHz 49,099,019,022 instructions # 2.24 insn per cycle 7.177090888 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 2.526108e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.851482 sec 15,621,448,969 cycles # 2.757 GHz 30,607,518,127 instructions # 1.96 insn per cycle 4.879296659 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.675595e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.661612 sec 11,770,617,273 cycles # 2.637 GHz 17,226,922,814 instructions # 1.46 insn per cycle 3.689494234 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.850632e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.637422 sec 11,844,213,793 cycles # 2.641 GHz 16,812,019,185 instructions # 1.42 insn per cycle 3.665992532 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 32 EvtsPerSec[MECalcOnly] (3a) = ( 4.655938e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.684760 sec 11,173,811,979 cycles # 2.485 GHz 13,674,017,211 instructions # 1.22 insn per cycle 3.712847551 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.308379e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.000368 sec 7,563.42 msec task-clock # 1.079 CPUs utilized 92 context-switches # 0.012 K/sec 34 cpu-migrations # 0.004 K/sec 6,623 page-faults # 0.876 K/sec 20,206,016,940 cycles # 2.672 GHz 48,559,476,997 instructions # 2.40 insn per cycle 1,617,798,346 branches # 213.898 M/sec 40,829,916 branch-misses # 2.52% of all branches 12,737,716,160 L1-dcache-loads # 1684.122 M/sec 129,760,167 L1-dcache-load-misses # 1.02% of all L1-dcache hits 7.009693745 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 2.533888e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.697803 sec 5,265.27 msec task-clock # 1.119 CPUs utilized 86 context-switches # 0.016 K/sec 27 cpu-migrations # 0.005 K/sec 6,623 page-faults # 0.001 M/sec 14,055,803,585 cycles # 2.670 GHz 30,070,489,639 instructions # 2.14 insn per cycle 1,358,524,691 branches # 258.016 M/sec 40,848,534 branch-misses # 3.01% of all branches 7,557,353,591 L1-dcache-loads # 1435.321 M/sec 122,717,179 L1-dcache-load-misses # 1.62% of all L1-dcache hits 4.706821960 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.603137e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.475700 sec 4,067.94 msec task-clock # 1.167 CPUs utilized 79 context-switches # 0.019 K/sec 32 cpu-migrations # 0.008 K/sec 7,133 page-faults # 0.002 M/sec 10,306,891,498 cycles # 2.534 GHz 16,693,240,427 instructions # 1.62 insn per cycle 1,237,173,420 branches # 304.127 M/sec 40,830,545 branch-misses # 3.30% of all branches 5,176,991,254 L1-dcache-loads # 1272.631 M/sec 101,854,732 L1-dcache-load-misses # 1.97% of all L1-dcache hits 3.485010692 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.911909e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.415136 sec 4,010.12 msec task-clock # 1.171 CPUs utilized 126 context-switches # 0.031 K/sec 33 cpu-migrations # 0.008 K/sec 6,624 page-faults # 0.002 M/sec 10,189,296,065 cycles # 2.541 GHz 16,274,934,090 instructions # 1.60 insn per cycle 1,135,981,115 branches # 283.279 M/sec 40,929,587 branch-misses # 3.60% of all branches 3,722,826,658 L1-dcache-loads # 928.358 M/sec 101,817,067 L1-dcache-load-misses # 2.73% of all L1-dcache hits 3.424371340 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 3.727578e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.846731 sec 4,425.39 msec task-clock # 1.148 CPUs utilized 87 context-switches # 0.020 K/sec 35 cpu-migrations # 0.008 K/sec 8,157 page-faults # 0.002 M/sec 9,967,056,024 cycles # 2.252 GHz 13,145,760,375 instructions # 1.32 insn per cycle 1,080,089,749 branches # 244.066 M/sec 41,016,506 branch-misses # 3.80% of all branches 2,849,996,485 L1-dcache-loads # 644.010 M/sec 95,185,042 L1-dcache-load-misses # 3.34% of all L1-dcache hits 3.855786619 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.237518e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.361609e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.725189 sec 2,549,736,022 cycles # 2.650 GHz 3,503,747,691 instructions # 1.37 insn per cycle 1.023542621 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,444 [-p 2048 256 1] =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[1] == AOS [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.219148e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351486e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371810e-02 +- 3.270123e-06 ) GeV^0 TOTAL : 1.185026 sec 3,343,282,112 cycles # 2.645 GHz 4,786,320,070 instructions # 1.43 insn per cycle 1.484666756 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 40 [-p 1 4 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,244,176 [-p 2048 256 1] =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[1] == AOS [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[1] == AOS EvtsPerSec[MatrixElems] (3) = ( 6.972024e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.266853e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371810e-02 +- 3.270123e-06 ) GeV^0 TOTAL : 1.128006 sec 3,365,367,187 cycles # 2.651 GHz 4,800,781,439 instructions # 1.43 insn per cycle 1.425973157 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 40 [-p 1 1 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 160 [-p 1 4 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 1,280 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 20,971,007 [-p 2048 256 1] =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[1] == AOS EvtsPerSec[MatrixElems] (3) = ( 6.997163e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.272792e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.208810 sec 3,334,791,948 cycles # 2.652 GHz 4,801,124,639 instructions # 1.44 insn per cycle 1.502558668 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 1,280 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 20,971,817 [-p 2048 256 1] =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[8] EvtsPerSec[MatrixElems] (3) = ( 7.230897e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.349811e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.161124 sec 3,327,628,044 cycles # 2.650 GHz 4,775,011,660 instructions # 1.43 insn per cycle 1.455159065 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,514 [-p 2048 256 1] =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.227665e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.352993e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.119368 sec 3,329,955,135 cycles # 2.649 GHz 4,782,880,833 instructions # 1.44 insn per cycle 1.414098430 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,035 [-p 2048 256 1] =========================================================================
This is because the kernel needs the four-momenta of all four particles, times four helicities (64) Instead of two four-momenta and two pz momenta, times four helicities (40) On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 5.129781e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 7.643036e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.223921 sec 3,081,196,662 cycles # 2.651 GHz 4,394,808,734 instructions # 1.43 insn per cycle 1.521394982 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 136 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 64 sectors 128 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 64 sectors 512 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 1,048,576 sectors 8,388,608 [-p 2048 256 1] =========================================================================
This is the result of a TEST with a single helicity: On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 1.081427e+09 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 3.581359e+09 ) sec^-1 MeanMatrixElemValue = ( 4.963658e-03 +- 1.770609e-06 ) GeV^0 TOTAL : 0.930117 sec 2,957,532,161 cycles # 2.651 GHz 4,187,693,693 instructions # 1.42 insn per cycle 1.227039609 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 98 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 10 sectors 20 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 10 sectors 80 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 163,840 sectors 1,310,720 [-p 2048 256 1] =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.043865e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351422e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.723324 sec 2,546,454,804 cycles # 2.656 GHz 3,488,166,591 instructions # 1.37 insn per cycle 1.020129392 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,242,606 [-p 2048 256 1] =========================================================================
Will self merge |
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A few additional tests while preparing the vchep paper