Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Improve throughput script + Results of further AOSOA tests #209

Merged
merged 18 commits into from
Jun 11, 2021

Conversation

valassi
Copy link
Member

@valassi valassi commented Jun 11, 2021

A few additional tests while preparing the vchep paper

valassi added 18 commits May 31, 2021 10:45
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.267203e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.364154e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.722013 sec
     2,528,979,124      cycles                    #    2.652 GHz
     3,490,457,956      instructions              #    1.38  insn per cycle
       1.013854254 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.309556e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     6.971825 sec
    20,204,537,861      cycles                    #    2.674 GHz
    48,560,277,826      instructions              #    2.40  insn per cycle
       6.979984273 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.534958e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.659099 sec
    14,025,995,113      cycles                    #    2.672 GHz
    30,075,975,631      instructions              #    2.14  insn per cycle
       4.667759564 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.611130e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.469995 sec
    10,299,552,854      cycles                    #    2.536 GHz
    16,693,116,256      instructions              #    1.62  insn per cycle
       3.478566716 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.925171e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.400321 sec
    10,168,815,414      cycles                    #    2.542 GHz
    16,278,373,573      instructions              #    1.60  insn per cycle
       3.408559417 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 3.726243e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.828006 sec
     9,961,706,306      cycles                    #    2.250 GHz
    13,142,521,259      instructions              #    1.32  insn per cycle
       3.836258066 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
=========================================================================

On pmpe04.cern.ch [CPU: Intel(R) Xeon(R) CPU E5-2630 v3] [GPU: none]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.305101e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     6.830661 sec
    21,856,861,877      cycles:u                  #    2.811 GHz
    48,486,844,412      instructions:u            #    2.22  insn per cycle
       6.838002353 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.234273e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.877238 sec
    15,542,989,077      cycles:u                  #    2.668 GHz
    30,005,506,418      instructions:u            #    1.93  insn per cycle
       4.884807714 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 4.039891e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.603382 sec
    11,649,111,915      cycles:u                  #    2.551 GHz
    16,629,656,104      instructions:u            #    1.43  insn per cycle
       3.610842924 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
ERROR! The application is built for skylake-avx512 (AVX512VL) but the host does not support it
         2,155,744      cycles:u                  #    0.697 GHz
         2,347,155      instructions:u            #    1.09  insn per cycle
       0.005606592 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
ERROR! The application is built for skylake-avx512 (AVX512VL) but the host does not support it
         1,984,218      cycles:u                  #    0.367 GHz
         2,347,081      instructions:u            #    1.18  insn per cycle
       0.007985069 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
=========================================================================
…the best)

On bmk32-cc7-rhl4dhp74r.cern.ch [CPU: Intel(R) Xeon(R) Gold 6130 CPU] [GPU: none]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.319663e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.106917 sec
real    0m7.136s
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.534639e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.812357 sec
real    0m4.841s
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 4.685652e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.651054 sec
real    0m3.679s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 4.822843e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.655420 sec
real    0m3.685s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 4.634158e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.676052 sec
real    0m3.706s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
=========================================================================

NB: code compiled on itscrd70 and copied over as-is
srcdir=/afs/cern.ch/user/a/avalassi/GPU2020/madgraph4gpuBis/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
cd /data/avalassi/GPU2020/gold
mkdir -p epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
cd epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum
\cp $srcdir/throughput12.sh .
\cp $srcdir/simdSymSummary.sh .
for f in $srcdir/build*/check.exe $srcdir/build*/CPPProcess.o; do echo $f; d=$(basename $(dirname $f)); mkdir -p $d; \cp $f $d; done
mkdir -p ../../Cards/
\cp $srcdir/../../Cards/param_card.dat ../../Cards/param_card.dat
\cp /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/lib64/libstdc++.so.6 .
\cp /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/lib64/libgomp.so.1 .
\cp /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/lib64/libgcc_s.so.1 .
cd ../../../../../..; tar -cvf gold.tar gold; cd -
scp ../../../../../../gold.tar avalassi@bmk32-cc7-rhl4dhp74r:/home/avalassi
… perf

On bmk32-cc7-rhl4dhp74r.cern.ch [CPU: Intel(R) Xeon(R) Gold 6130 CPU] [GPU: none]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.316104e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.148966 sec
    21,927,198,373      cycles                    #    2.754 GHz
    49,099,019,022      instructions              #    2.24  insn per cycle
       7.177090888 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.526108e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.851482 sec
    15,621,448,969      cycles                    #    2.757 GHz
    30,607,518,127      instructions              #    1.96  insn per cycle
       4.879296659 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 4.675595e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.661612 sec
    11,770,617,273      cycles                    #    2.637 GHz
    17,226,922,814      instructions              #    1.46  insn per cycle
       3.689494234 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 4.850632e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.637422 sec
    11,844,213,793      cycles                    #    2.641 GHz
    16,812,019,185      instructions              #    1.42  insn per cycle
       3.665992532 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 4.655938e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.684760 sec
    11,173,811,979      cycles                    #    2.485 GHz
    13,674,017,211      instructions              #    1.22  insn per cycle
       3.712847551 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
=========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.308379e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.000368 sec
          7,563.42 msec task-clock                #    1.079 CPUs utilized
                92      context-switches          #    0.012 K/sec
                34      cpu-migrations            #    0.004 K/sec
             6,623      page-faults               #    0.876 K/sec
    20,206,016,940      cycles                    #    2.672 GHz
    48,559,476,997      instructions              #    2.40  insn per cycle
     1,617,798,346      branches                  #  213.898 M/sec
        40,829,916      branch-misses             #    2.52% of all branches
    12,737,716,160      L1-dcache-loads           # 1684.122 M/sec
       129,760,167      L1-dcache-load-misses     #    1.02% of all L1-dcache hits
       7.009693745 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.533888e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.697803 sec
          5,265.27 msec task-clock                #    1.119 CPUs utilized
                86      context-switches          #    0.016 K/sec
                27      cpu-migrations            #    0.005 K/sec
             6,623      page-faults               #    0.001 M/sec
    14,055,803,585      cycles                    #    2.670 GHz
    30,070,489,639      instructions              #    2.14  insn per cycle
     1,358,524,691      branches                  #  258.016 M/sec
        40,848,534      branch-misses             #    3.01% of all branches
     7,557,353,591      L1-dcache-loads           # 1435.321 M/sec
       122,717,179      L1-dcache-load-misses     #    1.62% of all L1-dcache hits
       4.706821960 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.603137e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.475700 sec
          4,067.94 msec task-clock                #    1.167 CPUs utilized
                79      context-switches          #    0.019 K/sec
                32      cpu-migrations            #    0.008 K/sec
             7,133      page-faults               #    0.002 M/sec
    10,306,891,498      cycles                    #    2.534 GHz
    16,693,240,427      instructions              #    1.62  insn per cycle
     1,237,173,420      branches                  #  304.127 M/sec
        40,830,545      branch-misses             #    3.30% of all branches
     5,176,991,254      L1-dcache-loads           # 1272.631 M/sec
       101,854,732      L1-dcache-load-misses     #    1.97% of all L1-dcache hits
       3.485010692 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.911909e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.415136 sec
          4,010.12 msec task-clock                #    1.171 CPUs utilized
               126      context-switches          #    0.031 K/sec
                33      cpu-migrations            #    0.008 K/sec
             6,624      page-faults               #    0.002 M/sec
    10,189,296,065      cycles                    #    2.541 GHz
    16,274,934,090      instructions              #    1.60  insn per cycle
     1,135,981,115      branches                  #  283.279 M/sec
        40,929,587      branch-misses             #    3.60% of all branches
     3,722,826,658      L1-dcache-loads           #  928.358 M/sec
       101,817,067      L1-dcache-load-misses     #    2.73% of all L1-dcache hits
       3.424371340 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 3.727578e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.846731 sec
          4,425.39 msec task-clock                #    1.148 CPUs utilized
                87      context-switches          #    0.020 K/sec
                35      cpu-migrations            #    0.008 K/sec
             8,157      page-faults               #    0.002 M/sec
     9,967,056,024      cycles                    #    2.252 GHz
    13,145,760,375      instructions              #    1.32  insn per cycle
     1,080,089,749      branches                  #  244.066 M/sec
        41,016,506      branch-misses             #    3.80% of all branches
     2,849,996,485      L1-dcache-loads           #  644.010 M/sec
        95,185,042      L1-dcache-load-misses     #    3.34% of all L1-dcache hits
       3.855786619 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
=========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 7.237518e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.361609e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.725189 sec
     2,549,736,022      cycles                    #    2.650 GHz
     3,503,747,691      instructions              #    1.37  insn per cycle
       1.023542621 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,444 [-p 2048 256 1]
=========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[1] == AOS [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 7.219148e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.351486e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371810e-02 +- 3.270123e-06 )  GeV^0
TOTAL       :     1.185026 sec
     3,343,282,112      cycles                    #    2.645 GHz
     4,786,320,070      instructions              #    1.43  insn per cycle
       1.484666756 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 40 [-p 1 4 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,244,176 [-p 2048 256 1]
=========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[1] == AOS [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[1] == AOS
EvtsPerSec[MatrixElems] (3) = ( 6.972024e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.266853e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371810e-02 +- 3.270123e-06 )  GeV^0
TOTAL       :     1.128006 sec
     3,365,367,187      cycles                    #    2.651 GHz
     4,800,781,439      instructions              #    1.43  insn per cycle
       1.425973157 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 40 [-p 1 1 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 160 [-p 1 4 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 1,280 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 20,971,007 [-p 2048 256 1]
=========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[1] == AOS
EvtsPerSec[MatrixElems] (3) = ( 6.997163e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.272792e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.208810 sec
     3,334,791,948      cycles                    #    2.652 GHz
     4,801,124,639      instructions              #    1.44  insn per cycle
       1.502558668 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 1,280 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 20,971,817 [-p 2048 256 1]
=========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[8]
EvtsPerSec[MatrixElems] (3) = ( 7.230897e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.349811e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.161124 sec
     3,327,628,044      cycles                    #    2.650 GHz
     4,775,011,660      instructions              #    1.43  insn per cycle
       1.455159065 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,514 [-p 2048 256 1]
=========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 7.227665e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.352993e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.119368 sec
     3,329,955,135      cycles                    #    2.649 GHz
     4,782,880,833      instructions              #    1.44  insn per cycle
       1.414098430 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,035 [-p 2048 256 1]
=========================================================================
This is because the kernel needs the four-momenta of all four particles, times four helicities (64)
Instead of two four-momenta and two pz momenta, times four helicities (40)

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 5.129781e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 7.643036e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.223921 sec
     3,081,196,662      cycles                    #    2.651 GHz
     4,394,808,734      instructions              #    1.43  insn per cycle
       1.521394982 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 136
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 64 sectors 128 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 64 sectors 512 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 1,048,576 sectors 8,388,608 [-p 2048 256 1]
=========================================================================
This is the result of a TEST with a single helicity:

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 1.081427e+09                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 3.581359e+09                 )  sec^-1
MeanMatrixElemValue         = ( 4.963658e-03 +- 1.770609e-06 )  GeV^0
TOTAL       :     0.930117 sec
     2,957,532,161      cycles                    #    2.651 GHz
     4,187,693,693      instructions              #    1.42  insn per cycle
       1.227039609 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 98
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 10 sectors 20 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 10 sectors 80 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 163,840 sectors 1,310,720 [-p 2048 256 1]
=========================================================================


On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 7.043865e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.351422e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.723324 sec
     2,546,454,804      cycles                    #    2.656 GHz
     3,488,166,591      instructions              #    1.37  insn per cycle
       1.020129392 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,242,606 [-p 2048 256 1]
=========================================================================
@valassi
Copy link
Member Author

valassi commented Jun 11, 2021

Will self merge

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant