Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[klas3] vectorize the final calcualations in sigmakin (issue #178)
This squeezes one (last?) % of performance. Now 4.91E6, was 4.87E6. On itscrd70.cern.ch: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.307604e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.176768 sec real 0m7.186s =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.341151e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.731415 sec real 0m1.031s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.500771e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.879965 sec real 0m4.890s =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.600406e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.653161 sec real 0m3.663s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.909408e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.581065 sec real 0m3.591s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.736380e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.995692 sec real 0m4.006s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.145752e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.854923 sec real 0m7.865s =Symbols in CPPProcess.o= (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.451168e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.734622 sec real 0m1.027s ==PROF== Profiling "sigmaKin": launch__registers_per_thread 164 -------------------------------------------------------------------------
- Loading branch information