Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

klas2 (SIMD CPU) + epoch1/epoch2 #152

Closed
wants to merge 314 commits into from
Closed

Conversation

valassi
Copy link
Member

@valassi valassi commented Apr 2, 2021

This merges together klas2 #132 (replacing klas #72) and epoch12 #151.

It will replace #72 and #132. Open this as WIP.

…2 build flag.

It is using zmm registers a lot, but no sign of speedup (I am on a Skylake) in AVX512 vs AVX2.

objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|zmm'  | wc -l
1356
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
452

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[8]
Momenta memory layout      = AOSOA[8]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.819619e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.544286e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.753339e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.308055e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.613480e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.613480e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.613480e-01 ,  2.613480e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.372618e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.479249e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.006091e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372469e-02 +- 1.132952e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374903e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.203450e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000263 sec
0b MemAlloc :     0.027018 sec
0c GenCreat :     0.000858 sec
0d SGoodHel :     0.000145 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027525 sec
2a RamboIni :     0.007915 sec
2b RamboFin :     0.085166 sec
3a SigmaKin :     0.261348 sec
4a DumpLoop :     0.004633 sec
8a CompStat :     0.003537 sec
9a GenDestr :     0.000071 sec
9b DumpScrn :     0.000177 sec
9c DumpJson :     0.000008 sec
TOTAL       :     0.418671 sec
TOTAL (123) :     0.381962 sec
TOTAL  (23) :     0.354429 sec
TOTAL   (1) :     0.027533 sec
TOTAL   (2) :     0.093081 sec
TOTAL   (3) :     0.261348 sec
***********************************************************************
…numbers as GPU.

objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|zmm'  | wc -l
1247
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
1247

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.708773e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.432945e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.758284e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.129624e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.519982e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.519982e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.519982e-01 ,  2.519982e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.413643e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.527225e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.080523e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000286 sec
0b MemAlloc :     0.026870 sec
0c GenCreat :     0.000814 sec
0d SGoodHel :     0.000046 sec
1a GenSeed  :     0.000008 sec
1b GenRnGen :     0.027575 sec
2a RamboIni :     0.006751 sec
2b RamboFin :     0.084545 sec
3a SigmaKin :     0.251998 sec
4a DumpLoop :     0.004530 sec
8a CompStat :     0.003526 sec
9a GenDestr :     0.000116 sec
9b DumpScrn :     0.000167 sec
9c DumpJson :     0.000008 sec
TOTAL       :     0.407241 sec
TOTAL (123) :     0.370877 sec
TOTAL  (23) :     0.343294 sec
TOTAL   (1) :     0.027583 sec
TOTAL   (2) :     0.091296 sec
TOTAL   (3) :     0.251998 sec
***********************************************************************

./gcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.136103e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.523176e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.129270e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.734249e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.889270e-04                 )  sec
MeanTimeInMatrixElems      = ( 7.889270e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.889270e-04 ,  7.889270e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.346979e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 8.037312e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.645583e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     1.067698 sec
0a ProcInit :     0.000329 sec
0b MemAlloc :     0.035648 sec
0c GenCreat :     0.010443 sec
0d SGoodHel :     0.001837 sec
1a GenSeed  :     0.000011 sec
1b GenRnGen :     0.000602 sec
2a RamboIni :     0.000018 sec
2b RamboFin :     0.000011 sec
2c CpDTHwgt :     0.000506 sec
2d CpDTHmom :     0.005198 sec
3a SigmaKin :     0.000014 sec
3b CpDTHmes :     0.000775 sec
4a DumpLoop :     0.004293 sec
8a CompStat :     0.003611 sec
9a GenDestr :     0.000051 sec
9b DumpScrn :     0.000157 sec
9c DumpJson :     0.000007 sec
TOTAL       :     1.131209 sec
TOTAL (123) :     0.007136 sec
TOTAL  (23) :     0.006523 sec
TOTAL   (1) :     0.000613 sec
TOTAL   (2) :     0.005734 sec
TOTAL   (3) :     0.000789 sec
***********************************************************************
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
0
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|xmm'  | wc -l
2932

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[2]
Momenta memory layout      = AOSOA[2]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.923828e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.646642e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.771861e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.897543e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.656888e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.656888e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.656888e-01 ,  2.656888e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.336164e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.437728e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.973316e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372411e-02 +- 1.132746e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374897e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.201954e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000258 sec
0b MemAlloc :     0.027183 sec
0c GenCreat :     0.000877 sec
0d SGoodHel :     0.000036 sec
1a GenSeed  :     0.000011 sec
1b GenRnGen :     0.027708 sec
2a RamboIni :     0.006584 sec
2b RamboFin :     0.092392 sec
3a SigmaKin :     0.265689 sec
4a DumpLoop :     0.004384 sec
8a CompStat :     0.003518 sec
9a GenDestr :     0.000080 sec
9b DumpScrn :     0.000224 sec
9c DumpJson :     0.000008 sec
TOTAL       :     0.428950 sec
TOTAL (123) :     0.392383 sec
TOTAL  (23) :     0.364664 sec
TOTAL   (1) :     0.027719 sec
TOTAL   (2) :     0.098975 sec
TOTAL   (3) :     0.265689 sec
***********************************************************************
…ughput down a factor 3.

Note that throughput is still better than one week ago, why?

objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
0
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|xmm'  | wc -l
2908

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[1] == AOS
Momenta memory layout      = AOSOA[1] == AOS
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.240392e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.965008e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.753842e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.721096e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 5.992898e-01                 )  sec
MeanTimeInMatrixElems      = ( 5.992898e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 5.992898e-01 ,  5.992898e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.241155e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.527457e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 8.748488e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372323e-02 +- 1.131684e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.194264e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000219 sec
0b MemAlloc :     0.026940 sec
0c GenCreat :     0.000870 sec
0d SGoodHel :     0.000075 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027529 sec
2a RamboIni :     0.005977 sec
2b RamboFin :     0.091234 sec
3a SigmaKin :     0.599290 sec
4a DumpLoop :     0.004344 sec
8a CompStat :     0.003567 sec
9a GenDestr :     0.000077 sec
9b DumpScrn :     0.000177 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.760314 sec
TOTAL (123) :     0.724039 sec
TOTAL  (23) :     0.696501 sec
TOTAL   (1) :     0.027538 sec
TOTAL   (2) :     0.097211 sec
TOTAL   (3) :     0.599290 sec
***********************************************************************
…changes)

Actually faster now??

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[1] == AOS
Momenta memory layout      = AOSOA[1] == AOS
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 6.550682e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.274499e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.761832e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.700339e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 5.304465e-01                 )  sec
MeanTimeInMatrixElems      = ( 5.304465e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 5.304465e-01 ,  5.304465e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 8.003564e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 8.355855e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 9.883900e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372323e-02 +- 1.131684e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.194264e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000223 sec
0b MemAlloc :     0.027469 sec
0c GenCreat :     0.000820 sec
0d SGoodHel :     0.000057 sec
1a GenSeed  :     0.000010 sec
1b GenRnGen :     0.027609 sec
2a RamboIni :     0.005999 sec
2b RamboFin :     0.091005 sec
3a SigmaKin :     0.530446 sec
4a DumpLoop :     0.004305 sec
8a CompStat :     0.003523 sec
9a GenDestr :     0.000124 sec
9b DumpScrn :     0.000173 sec
9c DumpJson :     0.000009 sec
TOTAL       :     0.691771 sec
TOTAL (123) :     0.655068 sec
TOTAL  (23) :     0.627450 sec
TOTAL   (1) :     0.027618 sec
TOTAL   (2) :     0.097003 sec
TOTAL   (3) :     0.530446 sec
***********************************************************************
…x lower than AVX2.

objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
0
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|xmm'  | wc -l
2487

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[1] == AOS
Momenta memory layout      = AOSOA[1] == AOS
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.111078e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.083266e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.781247e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.737131e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 9.858946e-01                 )  sec
MeanTimeInMatrixElems      = ( 9.858946e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 9.858946e-01 ,  9.858946e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.718731e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 4.839883e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 5.317891e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372323e-02 +- 1.131684e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.194264e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000263 sec
0b MemAlloc :     0.026867 sec
0c GenCreat :     0.000882 sec
0d SGoodHel :     0.000096 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027803 sec
2a RamboIni :     0.006124 sec
2b RamboFin :     0.091247 sec
3a SigmaKin :     0.985895 sec
4a DumpLoop :     0.004363 sec
8a CompStat :     0.003518 sec
9a GenDestr :     0.000074 sec
9b DumpScrn :     0.000162 sec
9c DumpJson :     0.000008 sec
TOTAL       :     1.147311 sec
TOTAL (123) :     1.111078 sec
TOTAL  (23) :     1.083266 sec
TOTAL   (1) :     0.027812 sec
TOTAL   (2) :     0.097371 sec
TOTAL   (3) :     0.985895 sec
***********************************************************************
(I guess the 1.8 vs 2.0 fluctuation is the usual VM hypervisor issue?...)

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 4.073515e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.797491e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.760239e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.891759e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.808315e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.808315e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.808315e-01 ,  2.808315e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.287065e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.380617e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.866913e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000276 sec
0b MemAlloc :     0.026869 sec
0c GenCreat :     0.000815 sec
0d SGoodHel :     0.000049 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027594 sec
2a RamboIni :     0.006764 sec
2b RamboFin :     0.092153 sec
3a SigmaKin :     0.280831 sec
4a DumpLoop :     0.004415 sec
8a CompStat :     0.003583 sec
9a GenDestr :     0.000073 sec
9b DumpScrn :     0.000180 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.443619 sec
TOTAL (123) :     0.407351 sec
TOTAL  (23) :     0.379749 sec
TOTAL   (1) :     0.027602 sec
TOTAL   (2) :     0.098918 sec
TOTAL   (3) :     0.280831 sec
***********************************************************************
[Note: I get 1.88 while 567308f was 2.08: real effect or VM fluctuation?]

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 4.073515e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.797491e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.760239e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.891759e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.808315e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.808315e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.808315e-01 ,  2.808315e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.287065e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.380617e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.866913e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000276 sec
0b MemAlloc :     0.026869 sec
0c GenCreat :     0.000815 sec
0d SGoodHel :     0.000049 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027594 sec
2a RamboIni :     0.006764 sec
2b RamboFin :     0.092153 sec
3a SigmaKin :     0.280831 sec
4a DumpLoop :     0.004415 sec
8a CompStat :     0.003583 sec
9a GenDestr :     0.000073 sec
9b DumpScrn :     0.000180 sec
9c DumpJson :     0.000007 sec
TOTAL       :     0.443619 sec
TOTAL (123) :     0.407351 sec
TOTAL  (23) :     0.379749 sec
TOTAL   (1) :     0.027602 sec
TOTAL   (2) :     0.098918 sec
TOTAL   (3) :     0.280831 sec
***********************************************************************
…8 instead of 1.88

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.696970e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.420472e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.764989e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.099844e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.510487e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.510487e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.510487e-01 ,  2.510487e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.418156e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.532794e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.088391e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.002384 sec
0b MemAlloc :     0.027189 sec
0c GenCreat :     0.000924 sec
0d SGoodHel :     0.000039 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027641 sec
2a RamboIni :     0.006722 sec
2b RamboFin :     0.084276 sec
3a SigmaKin :     0.251049 sec
4a DumpLoop :     0.004665 sec
8a CompStat :     0.003574 sec
9a GenDestr :     0.000079 sec
9b DumpScrn :     0.000170 sec
9c DumpJson :     0.000008 sec
TOTAL       :     0.408729 sec
TOTAL (123) :     0.369697 sec
TOTAL  (23) :     0.342047 sec
TOTAL   (1) :     0.027650 sec
TOTAL   (2) :     0.090998 sec
TOTAL   (3) :     0.251049 sec
***********************************************************************
./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[1] == AOS
Momenta memory layout      = AOSOA[1] == AOS
Internal loops fptype_sv   = SCALAR (no SIMD)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.098961e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 1.070591e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.837009e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.763392e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 9.729566e-01                 )  sec
MeanTimeInMatrixElems      = ( 9.729566e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 9.729566e-01 ,  9.729566e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.770762e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 4.897185e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 5.388606e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372323e-02 +- 1.131684e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.194264e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[8]
Momenta memory layout      = AOSOA[8]
Internal loops fptype_sv   = VECTOR[8] (AVX512F)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.827660e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.550714e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.769458e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.366414e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.614073e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.614073e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.614073e-01 ,  2.614073e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.369735e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.476571e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.005637e+06                 )  sec^-1
***********************************************************************
./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[8]
Momenta memory layout      = AOSOA[8]
Internal loops fptype_sv   = VECTOR[8] (AVX512F)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.831118e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.554309e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.768089e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.398625e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.614447e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.614447e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.614447e-01 ,  2.614447e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.368499e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.475077e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.005350e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372469e-02 +- 1.132952e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374903e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.203450e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
…or-width=512)

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[8]
Momenta memory layout      = AOSOA[8]
Internal loops fptype_sv   = VECTOR[8] (AVX512F)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.753826e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.477140e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.766860e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.360017e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.541139e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.541139e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.541139e-01 ,  2.541139e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.396676e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.507814e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.063201e+06                 )  sec^-1
***********************************************************************
…an AVX512F)

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Internal loops fptype_sv   = VECTOR[4] (AVX2)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.731868e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.455043e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.768257e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.354625e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.519580e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.519580e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.519580e-01 ,  2.519580e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.404894e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.517457e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.080855e+06                 )  sec^-1
***********************************************************************
…ster?...)

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = FLOAT (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[1] == AOS
Momenta memory layout      = AOSOA[1] == AOS
Internal loops fptype_sv   = SCALAR (no SIMD)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 2.974846e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 2.946718e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.812788e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.032891e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.843429e+00                 )  sec
MeanTimeInMatrixElems      = ( 2.843429e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.843429e+00 ,  2.843429e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.762404e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.779227e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.843858e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372328e-02 +- 1.131740e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 5.445383e-03 ,  7.884562e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.194670e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
…t multiplied by float vector)

Results ok in no-simd float. Throughput 1.84E5 (lower than no-simd double?!)

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = FLOAT (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[1] == AOS
Momenta memory layout      = AOSOA[1] == AOS
Internal loops fptype_sv   = SCALAR (no SIMD)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 2.973374e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 2.945622e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.775213e-02                 )  sec
TotalTime[Rambo]        (2)= ( 1.001029e-01                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.845519e+00                 )  sec
MeanTimeInMatrixElems      = ( 2.845519e+00                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.845519e+00 ,  2.845519e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.763276e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.779889e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.842504e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372328e-02 +- 1.131740e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 5.445383e-03 ,  7.884562e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.194670e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = FLOAT (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[8]
Momenta memory layout      = AOSOA[8]
Internal loops fptype_sv   = VECTOR[8] (AVX2)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 2.929596e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 2.650706e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.788902e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.624017e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.688304e-01                 )  sec
MeanTimeInMatrixElems      = ( 1.688304e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.688304e-01 ,  1.688304e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.789626e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.977919e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 3.105412e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372471e-02 +- 1.132959e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 5.997265e-03 ,  3.883831e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.203501e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000295 sec
0b MemAlloc :     0.015355 sec
0c GenCreat :     0.000865 sec
0d SGoodHel :     0.000040 sec
1a GenSeed  :     0.000009 sec
1b GenRnGen :     0.027880 sec
2a RamboIni :     0.005198 sec
2b RamboFin :     0.091043 sec
3a SigmaKin :     0.168830 sec
4a DumpLoop :     0.003573 sec
8a CompStat :     0.003527 sec
9a GenDestr :     0.000074 sec
9b DumpScrn :     0.000151 sec
9c DumpJson :     0.000008 sec
TOTAL       :     0.316847 sec
TOTAL (123) :     0.292960 sec
TOTAL  (23) :     0.265071 sec
TOTAL   (1) :     0.027889 sec
TOTAL   (2) :     0.096240 sec
TOTAL   (3) :     0.168830 sec
***********************************************************************
./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = FLOAT (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[16]
Momenta memory layout      = AOSOA[16]
Internal loops fptype_sv   = VECTOR[16] (AVX512F)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 1.230690e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 9.521992e-02                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.784907e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.479249e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 4.274280e-04                 )  sec
MeanTimeInMatrixElems      = ( 4.274280e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 4.274280e-04 ,  4.274280e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.260115e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 5.506075e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.226611e+09                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 0.000000e+00 +- 0.000000e+00 )  GeV^0
[Min,Max]MatrixElemValue   = [ 0.000000e+00 ,  0.000000e+00 ]  GeV^0
StdDevMatrixElemValue      = ( 0.000000e+00                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
Result ok in float AVX512. Not faster than float AVX2.

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = FLOAT (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[16]
Momenta memory layout      = AOSOA[16]
Internal loops fptype_sv   = VECTOR[16] (AVX512F)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 2.983189e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 2.705257e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.779320e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.526685e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 1.752588e-01                 )  sec
MeanTimeInMatrixElems      = ( 1.752588e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 1.752588e-01 ,  1.752588e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.757475e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.938034e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.991507e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.372028e-02 +- 1.132663e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.032828e-03 ,  3.488652e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.201354e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Internal loops fptype_sv   = VECTOR[4] (AVX2)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.702728e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.425338e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.773900e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.132473e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.512090e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.512090e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.512090e-01 ,  2.512090e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.415951e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.530617e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.087059e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
 ./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[8]
Momenta memory layout      = AOSOA[8]
Internal loops fptype_sv   = VECTOR[8] (AVX512F)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.759226e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.481836e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.773900e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.471789e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.534657e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.534657e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.534657e-01 ,  2.534657e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.394670e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.505780e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.068477e+06                 )  sec^-1
***********************************************************************
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
1239

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Internal loops fptype_sv   = VECTOR[4] (AVX2)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.714090e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.437035e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.770548e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.189213e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.518114e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.518114e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.518114e-01 ,  2.518114e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.411619e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.525408e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.082067e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************

./gcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 7.326907e-03                 )  sec
TotalTime[Rambo+ME]    (23)= ( 6.668723e-03                 )  sec
TotalTime[RndNumGen]    (1)= ( 6.581840e-04                 )  sec
TotalTime[Rambo]        (2)= ( 5.870857e-03                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.978660e-04                 )  sec
MeanTimeInMatrixElems      = ( 7.978660e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.978660e-04 ,  7.978660e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.155652e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 7.861895e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.571128e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm'  | wc -l
1246

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid           = 16384
NumThreadsPerBlock         = 32
NumIterations              = 1
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Internal loops fptype_sv   = VECTOR[4] (AVX2)
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 1
TotalTime[Rnd+Rmb+ME] (123)= ( 3.720043e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 3.441720e-01                 )  sec
TotalTime[RndNumGen]    (1)= ( 2.783231e-02                 )  sec
TotalTime[Rambo]        (2)= ( 9.215697e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 2.520150e-01                 )  sec
MeanTimeInMatrixElems      = ( 2.520150e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 2.520150e-01 ,  2.520150e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 524288
EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.409360e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 1.523332e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 2.080384e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 524288
MeanMatrixElemValue        = ( 1.371958e-02 +- 1.132119e-05 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071582e-03 ,  3.374915e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.197419e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
…ype_v:

g++  -O3 -std=c++11 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra  -march=core-avx2  -I/usr/local/cuda-11.1/include/ -c check.cc -o check.o
check.cc: In function ‘std::unique_ptr<T []> hstMakeUnique(std::size_t) [with T = __vector(4) double; std::size_t = long unsigned int]’:
check.cc:82:118: warning: ‘new’ of type ‘mgOnGpu::fptype_v’ {aka ‘__vector(4) double’} with extended alignment 32 [-Waligned-new=]
 _ptr<fptype_v[]> hstMakeUnique(std::size_t N) { return std::unique_ptr<fptype_v[]>{ new fptype_v[N/neppV]() }; };
                                                                                                           ^
check.cc:82:118: note: uses ‘void* operator new [](std::size_t)’, which does not have an alignment parameter
check.cc:82:118: note: use ‘-faligned-new’ to enable C++17 over-aligned new support
valassi added 27 commits April 14, 2021 09:32
…do it later

Now revert this, complete some xxx things first, and later do this move
Fix all conflicts. All tests pass in epoch1 (avx512 and none) and epoch2.
Fix conflicts in epoch1 testxxx.cc and CPPProcess.cc
Fix conflicts epoch1 CPPProcess.cc

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.312772e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.154965 sec
real    0m7.164s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 7.292815e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.955000 sec
real    0m1.246s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[4] (AVX512F)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.715700e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.620541 sec
real    0m3.630s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 7.299221e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.168761 sec
real    0m1.465s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.131055e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.925395 sec
real    0m7.935s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 7.399876e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.915813 sec
real    0m1.208s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.311938e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.160762 sec
real    0m7.170s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 7.165268e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.740539 sec
real    0m1.032s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[4] (AVX512F)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.716124e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.622706 sec
real    0m3.633s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 7.270777e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.774117 sec
real    0m1.067s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.138079e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.891635 sec
real    0m7.901s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 7.277430e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.966146 sec
real    0m1.258s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc

Baseline performance after the last(?) merge of testxxx.
CUDA in epoch1 is always around 1-2% slower than in epoch2, due to the SIMD changes in C++.

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.317036e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.141695 sec
real    0m7.152s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 7.308881e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.727713 sec
real    0m1.018s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[4] (AVX512F)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.883989e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.597827 sec
real    0m3.607s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 7.252899e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.731646 sec
real    0m1.022s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.147018e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.843119 sec
real    0m7.852s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 7.399451e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.730010 sec
real    0m1.023s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
…g bug fix?)

If MGONGPU_CPPSIMD is not defined, then __AVX2__ and similar macros must be ignored!
…mdSymSummary.sh

for avx in none sse4 avx2 512y 512z; do f=./build.$avx/CPPProcess.o; ./simdSymSummary.sh $f; done
=== Symbols in ./build.none/CPPProcess.o === (~sse4:  611) (avx2:    0) (512y:    0) (512z:    0)
=== Symbols in ./build.sse4/CPPProcess.o === (~sse4: 3278) (avx2:    0) (512y:    0) (512z:    0)
=== Symbols in ./build.avx2/CPPProcess.o === (~sse4:    0) (avx2: 2766) (512y:    0) (512z:    0)
=== Symbols in ./build.512y/CPPProcess.o === (~sse4:    0) (avx2: 2597) (512y:   94) (512z:    0)
=== Symbols in ./build.512z/CPPProcess.o === (~sse4:    0) (avx2: 1182) (512y:  208) (512z: 2035)

for avx in none sse4 avx2 512y 512z; do f=./build.$avx/CPPProcess.o; ./simdSymSummary.sh -helamps $f; done
=== Symbols in ./build.none/CPPProcess.o === (~sse4:  536) (avx2:    0) (512y:    0) (512z:    0)
=== Symbols in ./build.sse4/CPPProcess.o === (~sse4: 3145) (avx2:    0) (512y:    0) (512z:    0)
=== Symbols in ./build.avx2/CPPProcess.o === (~sse4:    0) (avx2: 2500) (512y:    0) (512z:    0)
=== Symbols in ./build.512y/CPPProcess.o === (~sse4:    0) (avx2: 2369) (512y:   77) (512z:    0)
=== Symbols in ./build.512z/CPPProcess.o === (~sse4:    0) (avx2:  979) (512y:  189) (512z: 2015)
The idea is to distinguish between AVX512 on xmm/ymm and AVX512 on zmm registers.
The former ("AVX256" in LHCb) is the fastest option.
The latter is slower, probably because it causes a closk slowdown? (to be demonstrated)

Note in any case that '512z' seems to have many more instructions in total (why?)
=== Symbols in ./build.512y/CPPProcess.o === (~sse4:    0) (avx2: 2597) (512y:   94) (512z:    0)
=== Symbols in ./build.512z/CPPProcess.o === (~sse4:    0) (avx2: 1182) (512y:  208) (512z: 2035)
Current baseline performance on all avx modes

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[1] ('none': scalar, no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.313578e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.158000 sec
real    0m7.168s
=== Symbols in CPPProcess.o === (~sse4:  611) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.467718e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.769639 sec
real    0m1.073s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.479539e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.896227 sec
real    0m4.906s
=== Symbols in CPPProcess.o === (~sse4: 3278) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.490480e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.695938 sec
real    0m3.706s
=== Symbols in CPPProcess.o === (~sse4:    0) (avx2: 2766) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.912842e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.571775 sec
real    0m3.581s
=== Symbols in CPPProcess.o === (~sse4:    0) (avx2: 2597) (512y:   94) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.702956e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.014852 sec
real    0m4.024s
=== Symbols in CPPProcess.o === (~sse4:    0) (avx2: 1182) (512y:  208) (512z: 2035)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.147804e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.834120 sec
real    0m7.843s
=== Symbols in CPPProcess.o === (~sse4:  561) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
EvtsPerSec[MatrixElems] (3) = ( 6.520156e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.763560 sec
real    0m1.068s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Fix conflicts: epoch1 CPPProcess.cc and throughput12.sh

Baseline performance after the merge:
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[1] ('none': scalar, no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.307178e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.202903 sec
real    0m7.213s
=== Symbols in CPPProcess.o === (~sse4:  617) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.196734e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.748795 sec
real    0m1.051s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.515310e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.851119 sec
real    0m4.861s
=== Symbols in CPPProcess.o === (~sse4: 3264) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.451217e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.709405 sec
real    0m3.719s
=== Symbols in CPPProcess.o === (~sse4:    0) (avx2: 2770) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.842292e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.604947 sec
real    0m3.615s
=== Symbols in CPPProcess.o === (~sse4:    0) (avx2: 2600) (512y:   94) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.727711e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.005770 sec
real    0m4.016s
=== Symbols in CPPProcess.o === (~sse4:    0) (avx2: 1184) (512y:  208) (512z: 2040)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.146197e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.847846 sec
real    0m7.860s
=== Symbols in CPPProcess.o === (~sse4:  567) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.293983e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.747671 sec
real    0m1.050s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
Fix various conflicts.

Baseline double performance
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[1] ('none': scalar, no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.305953e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.184897 sec
real    0m7.195s
=== Symbols in CPPProcess.o === (~sse4:  617) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 6.997065e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.906774 sec
real    0m1.209s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.512330e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.854177 sec
real    0m4.864s
=== Symbols in CPPProcess.o === (~sse4: 3264) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.446023e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.712585 sec
real    0m3.722s
=== Symbols in CPPProcess.o === (~sse4:    0) (avx2: 2770) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.799523e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.602933 sec
real    0m3.613s
=== Symbols in CPPProcess.o === (~sse4:    0) (avx2: 2600) (512y:   94) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512 vector width)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.718682e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.009003 sec
real    0m4.018s
=== Symbols in CPPProcess.o === (~sse4:    0) (avx2: 1184) (512y:  208) (512z: 2040)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.147329e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.841047 sec
real    0m7.851s
=== Symbols in CPPProcess.o === (~sse4:  567) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.151338e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.878278 sec
real    0m1.170s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
@valassi
Copy link
Member Author

valassi commented Apr 23, 2021

For some reasons that I do not understand (I did a force push at one point, but then things seemed ok), this is now saying there are conflicts. I found it easier to recreate an identical branch and resubmit the PR #171, which is replacing this #152. Closing.

@valassi valassi closed this Apr 23, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant