-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
klas2 (SIMD CPU) + epoch1/epoch2 #152
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…2 build flag. It is using zmm registers a lot, but no sign of speedup (I am on a Skylake) in AVX512 vs AVX2. objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|zmm' | wc -l 1356 objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l 452 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.819619e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.544286e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.753339e-02 ) sec TotalTime[Rambo] (2)= ( 9.308055e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.613480e-01 ) sec MeanTimeInMatrixElems = ( 2.613480e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.613480e-01 , 2.613480e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.372618e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.479249e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.006091e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372469e-02 +- 1.132952e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374903e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.203450e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000263 sec 0b MemAlloc : 0.027018 sec 0c GenCreat : 0.000858 sec 0d SGoodHel : 0.000145 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027525 sec 2a RamboIni : 0.007915 sec 2b RamboFin : 0.085166 sec 3a SigmaKin : 0.261348 sec 4a DumpLoop : 0.004633 sec 8a CompStat : 0.003537 sec 9a GenDestr : 0.000071 sec 9b DumpScrn : 0.000177 sec 9c DumpJson : 0.000008 sec TOTAL : 0.418671 sec TOTAL (123) : 0.381962 sec TOTAL (23) : 0.354429 sec TOTAL (1) : 0.027533 sec TOTAL (2) : 0.093081 sec TOTAL (3) : 0.261348 sec ***********************************************************************
…numbers as GPU. objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|zmm' | wc -l 1247 objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l 1247 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.708773e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.432945e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.758284e-02 ) sec TotalTime[Rambo] (2)= ( 9.129624e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.519982e-01 ) sec MeanTimeInMatrixElems = ( 2.519982e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.519982e-01 , 2.519982e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.413643e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.527225e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.080523e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000286 sec 0b MemAlloc : 0.026870 sec 0c GenCreat : 0.000814 sec 0d SGoodHel : 0.000046 sec 1a GenSeed : 0.000008 sec 1b GenRnGen : 0.027575 sec 2a RamboIni : 0.006751 sec 2b RamboFin : 0.084545 sec 3a SigmaKin : 0.251998 sec 4a DumpLoop : 0.004530 sec 8a CompStat : 0.003526 sec 9a GenDestr : 0.000116 sec 9b DumpScrn : 0.000167 sec 9c DumpJson : 0.000008 sec TOTAL : 0.407241 sec TOTAL (123) : 0.370877 sec TOTAL (23) : 0.343294 sec TOTAL (1) : 0.027583 sec TOTAL (2) : 0.091296 sec TOTAL (3) : 0.251998 sec *********************************************************************** ./gcheck.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 7.136103e-03 ) sec TotalTime[Rambo+ME] (23)= ( 6.523176e-03 ) sec TotalTime[RndNumGen] (1)= ( 6.129270e-04 ) sec TotalTime[Rambo] (2)= ( 5.734249e-03 ) sec TotalTime[MatrixElems] (3)= ( 7.889270e-04 ) sec MeanTimeInMatrixElems = ( 7.889270e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.889270e-04 , 7.889270e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.346979e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 8.037312e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.645583e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 1.067698 sec 0a ProcInit : 0.000329 sec 0b MemAlloc : 0.035648 sec 0c GenCreat : 0.010443 sec 0d SGoodHel : 0.001837 sec 1a GenSeed : 0.000011 sec 1b GenRnGen : 0.000602 sec 2a RamboIni : 0.000018 sec 2b RamboFin : 0.000011 sec 2c CpDTHwgt : 0.000506 sec 2d CpDTHmom : 0.005198 sec 3a SigmaKin : 0.000014 sec 3b CpDTHmes : 0.000775 sec 4a DumpLoop : 0.004293 sec 8a CompStat : 0.003611 sec 9a GenDestr : 0.000051 sec 9b DumpScrn : 0.000157 sec 9c DumpJson : 0.000007 sec TOTAL : 1.131209 sec TOTAL (123) : 0.007136 sec TOTAL (23) : 0.006523 sec TOTAL (1) : 0.000613 sec TOTAL (2) : 0.005734 sec TOTAL (3) : 0.000789 sec ***********************************************************************
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l 0 objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|xmm' | wc -l 2932 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[2] Momenta memory layout = AOSOA[2] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.923828e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.646642e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.771861e-02 ) sec TotalTime[Rambo] (2)= ( 9.897543e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.656888e-01 ) sec MeanTimeInMatrixElems = ( 2.656888e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.656888e-01 , 2.656888e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.336164e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.437728e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.973316e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372411e-02 +- 1.132746e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374897e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.201954e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000258 sec 0b MemAlloc : 0.027183 sec 0c GenCreat : 0.000877 sec 0d SGoodHel : 0.000036 sec 1a GenSeed : 0.000011 sec 1b GenRnGen : 0.027708 sec 2a RamboIni : 0.006584 sec 2b RamboFin : 0.092392 sec 3a SigmaKin : 0.265689 sec 4a DumpLoop : 0.004384 sec 8a CompStat : 0.003518 sec 9a GenDestr : 0.000080 sec 9b DumpScrn : 0.000224 sec 9c DumpJson : 0.000008 sec TOTAL : 0.428950 sec TOTAL (123) : 0.392383 sec TOTAL (23) : 0.364664 sec TOTAL (1) : 0.027719 sec TOTAL (2) : 0.098975 sec TOTAL (3) : 0.265689 sec ***********************************************************************
…ughput down a factor 3. Note that throughput is still better than one week ago, why? objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l 0 objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|xmm' | wc -l 2908 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[1] == AOS Momenta memory layout = AOSOA[1] == AOS Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 7.240392e-01 ) sec TotalTime[Rambo+ME] (23)= ( 6.965008e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.753842e-02 ) sec TotalTime[Rambo] (2)= ( 9.721096e-02 ) sec TotalTime[MatrixElems] (3)= ( 5.992898e-01 ) sec MeanTimeInMatrixElems = ( 5.992898e-01 ) sec [Min,Max]TimeInMatrixElems = [ 5.992898e-01 , 5.992898e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.241155e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.527457e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 8.748488e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372323e-02 +- 1.131684e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.194264e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000219 sec 0b MemAlloc : 0.026940 sec 0c GenCreat : 0.000870 sec 0d SGoodHel : 0.000075 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027529 sec 2a RamboIni : 0.005977 sec 2b RamboFin : 0.091234 sec 3a SigmaKin : 0.599290 sec 4a DumpLoop : 0.004344 sec 8a CompStat : 0.003567 sec 9a GenDestr : 0.000077 sec 9b DumpScrn : 0.000177 sec 9c DumpJson : 0.000007 sec TOTAL : 0.760314 sec TOTAL (123) : 0.724039 sec TOTAL (23) : 0.696501 sec TOTAL (1) : 0.027538 sec TOTAL (2) : 0.097211 sec TOTAL (3) : 0.599290 sec ***********************************************************************
…changes) Actually faster now?? ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[1] == AOS Momenta memory layout = AOSOA[1] == AOS Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 6.550682e-01 ) sec TotalTime[Rambo+ME] (23)= ( 6.274499e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.761832e-02 ) sec TotalTime[Rambo] (2)= ( 9.700339e-02 ) sec TotalTime[MatrixElems] (3)= ( 5.304465e-01 ) sec MeanTimeInMatrixElems = ( 5.304465e-01 ) sec [Min,Max]TimeInMatrixElems = [ 5.304465e-01 , 5.304465e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 8.003564e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 8.355855e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 9.883900e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372323e-02 +- 1.131684e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.194264e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000223 sec 0b MemAlloc : 0.027469 sec 0c GenCreat : 0.000820 sec 0d SGoodHel : 0.000057 sec 1a GenSeed : 0.000010 sec 1b GenRnGen : 0.027609 sec 2a RamboIni : 0.005999 sec 2b RamboFin : 0.091005 sec 3a SigmaKin : 0.530446 sec 4a DumpLoop : 0.004305 sec 8a CompStat : 0.003523 sec 9a GenDestr : 0.000124 sec 9b DumpScrn : 0.000173 sec 9c DumpJson : 0.000009 sec TOTAL : 0.691771 sec TOTAL (123) : 0.655068 sec TOTAL (23) : 0.627450 sec TOTAL (1) : 0.027618 sec TOTAL (2) : 0.097003 sec TOTAL (3) : 0.530446 sec ***********************************************************************
…x lower than AVX2. objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l 0 objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm|xmm' | wc -l 2487 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[1] == AOS Momenta memory layout = AOSOA[1] == AOS Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.111078e+00 ) sec TotalTime[Rambo+ME] (23)= ( 1.083266e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.781247e-02 ) sec TotalTime[Rambo] (2)= ( 9.737131e-02 ) sec TotalTime[MatrixElems] (3)= ( 9.858946e-01 ) sec MeanTimeInMatrixElems = ( 9.858946e-01 ) sec [Min,Max]TimeInMatrixElems = [ 9.858946e-01 , 9.858946e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.718731e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 4.839883e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 5.317891e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372323e-02 +- 1.131684e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.194264e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000263 sec 0b MemAlloc : 0.026867 sec 0c GenCreat : 0.000882 sec 0d SGoodHel : 0.000096 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027803 sec 2a RamboIni : 0.006124 sec 2b RamboFin : 0.091247 sec 3a SigmaKin : 0.985895 sec 4a DumpLoop : 0.004363 sec 8a CompStat : 0.003518 sec 9a GenDestr : 0.000074 sec 9b DumpScrn : 0.000162 sec 9c DumpJson : 0.000008 sec TOTAL : 1.147311 sec TOTAL (123) : 1.111078 sec TOTAL (23) : 1.083266 sec TOTAL (1) : 0.027812 sec TOTAL (2) : 0.097371 sec TOTAL (3) : 0.985895 sec ***********************************************************************
(I guess the 1.8 vs 2.0 fluctuation is the usual VM hypervisor issue?...) ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 4.073515e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.797491e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.760239e-02 ) sec TotalTime[Rambo] (2)= ( 9.891759e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.808315e-01 ) sec MeanTimeInMatrixElems = ( 2.808315e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.808315e-01 , 2.808315e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.287065e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.380617e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.866913e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000276 sec 0b MemAlloc : 0.026869 sec 0c GenCreat : 0.000815 sec 0d SGoodHel : 0.000049 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027594 sec 2a RamboIni : 0.006764 sec 2b RamboFin : 0.092153 sec 3a SigmaKin : 0.280831 sec 4a DumpLoop : 0.004415 sec 8a CompStat : 0.003583 sec 9a GenDestr : 0.000073 sec 9b DumpScrn : 0.000180 sec 9c DumpJson : 0.000007 sec TOTAL : 0.443619 sec TOTAL (123) : 0.407351 sec TOTAL (23) : 0.379749 sec TOTAL (1) : 0.027602 sec TOTAL (2) : 0.098918 sec TOTAL (3) : 0.280831 sec ***********************************************************************
[Note: I get 1.88 while 567308f was 2.08: real effect or VM fluctuation?] ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 4.073515e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.797491e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.760239e-02 ) sec TotalTime[Rambo] (2)= ( 9.891759e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.808315e-01 ) sec MeanTimeInMatrixElems = ( 2.808315e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.808315e-01 , 2.808315e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.287065e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.380617e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.866913e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000276 sec 0b MemAlloc : 0.026869 sec 0c GenCreat : 0.000815 sec 0d SGoodHel : 0.000049 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027594 sec 2a RamboIni : 0.006764 sec 2b RamboFin : 0.092153 sec 3a SigmaKin : 0.280831 sec 4a DumpLoop : 0.004415 sec 8a CompStat : 0.003583 sec 9a GenDestr : 0.000073 sec 9b DumpScrn : 0.000180 sec 9c DumpJson : 0.000007 sec TOTAL : 0.443619 sec TOTAL (123) : 0.407351 sec TOTAL (23) : 0.379749 sec TOTAL (1) : 0.027602 sec TOTAL (2) : 0.098918 sec TOTAL (3) : 0.280831 sec ***********************************************************************
…8 instead of 1.88 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.696970e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.420472e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.764989e-02 ) sec TotalTime[Rambo] (2)= ( 9.099844e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.510487e-01 ) sec MeanTimeInMatrixElems = ( 2.510487e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.510487e-01 , 2.510487e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.418156e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.532794e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.088391e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.002384 sec 0b MemAlloc : 0.027189 sec 0c GenCreat : 0.000924 sec 0d SGoodHel : 0.000039 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027641 sec 2a RamboIni : 0.006722 sec 2b RamboFin : 0.084276 sec 3a SigmaKin : 0.251049 sec 4a DumpLoop : 0.004665 sec 8a CompStat : 0.003574 sec 9a GenDestr : 0.000079 sec 9b DumpScrn : 0.000170 sec 9c DumpJson : 0.000008 sec TOTAL : 0.408729 sec TOTAL (123) : 0.369697 sec TOTAL (23) : 0.342047 sec TOTAL (1) : 0.027650 sec TOTAL (2) : 0.090998 sec TOTAL (3) : 0.251049 sec ***********************************************************************
./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[1] == AOS Momenta memory layout = AOSOA[1] == AOS Internal loops fptype_sv = SCALAR (no SIMD) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.098961e+00 ) sec TotalTime[Rambo+ME] (23)= ( 1.070591e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.837009e-02 ) sec TotalTime[Rambo] (2)= ( 9.763392e-02 ) sec TotalTime[MatrixElems] (3)= ( 9.729566e-01 ) sec MeanTimeInMatrixElems = ( 9.729566e-01 ) sec [Min,Max]TimeInMatrixElems = [ 9.729566e-01 , 9.729566e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.770762e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 4.897185e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 5.388606e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372323e-02 +- 1.131684e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.194264e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
*********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Internal loops fptype_sv = VECTOR[8] (AVX512F) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.827660e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.550714e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.769458e-02 ) sec TotalTime[Rambo] (2)= ( 9.366414e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.614073e-01 ) sec MeanTimeInMatrixElems = ( 2.614073e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.614073e-01 , 2.614073e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.369735e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.476571e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.005637e+06 ) sec^-1 ***********************************************************************
./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Internal loops fptype_sv = VECTOR[8] (AVX512F) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.831118e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.554309e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.768089e-02 ) sec TotalTime[Rambo] (2)= ( 9.398625e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.614447e-01 ) sec MeanTimeInMatrixElems = ( 2.614447e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.614447e-01 , 2.614447e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.368499e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.475077e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.005350e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372469e-02 +- 1.132952e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374903e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.203450e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
…or-width=512) ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Internal loops fptype_sv = VECTOR[8] (AVX512F) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.753826e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.477140e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.766860e-02 ) sec TotalTime[Rambo] (2)= ( 9.360017e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.541139e-01 ) sec MeanTimeInMatrixElems = ( 2.541139e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.541139e-01 , 2.541139e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.396676e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.507814e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.063201e+06 ) sec^-1 ***********************************************************************
…an AVX512F) ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Internal loops fptype_sv = VECTOR[4] (AVX2) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.731868e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.455043e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.768257e-02 ) sec TotalTime[Rambo] (2)= ( 9.354625e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.519580e-01 ) sec MeanTimeInMatrixElems = ( 2.519580e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.519580e-01 , 2.519580e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.404894e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.517457e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.080855e+06 ) sec^-1 ***********************************************************************
…ster?...) ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = FLOAT (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[1] == AOS Momenta memory layout = AOSOA[1] == AOS Internal loops fptype_sv = SCALAR (no SIMD) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 2.974846e+00 ) sec TotalTime[Rambo+ME] (23)= ( 2.946718e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.812788e-02 ) sec TotalTime[Rambo] (2)= ( 1.032891e-01 ) sec TotalTime[MatrixElems] (3)= ( 2.843429e+00 ) sec MeanTimeInMatrixElems = ( 2.843429e+00 ) sec [Min,Max]TimeInMatrixElems = [ 2.843429e+00 , 2.843429e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.762404e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.779227e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.843858e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372328e-02 +- 1.131740e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 5.445383e-03 , 7.884562e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.194670e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
…t multiplied by float vector) Results ok in no-simd float. Throughput 1.84E5 (lower than no-simd double?!) ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = FLOAT (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[1] == AOS Momenta memory layout = AOSOA[1] == AOS Internal loops fptype_sv = SCALAR (no SIMD) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 2.973374e+00 ) sec TotalTime[Rambo+ME] (23)= ( 2.945622e+00 ) sec TotalTime[RndNumGen] (1)= ( 2.775213e-02 ) sec TotalTime[Rambo] (2)= ( 1.001029e-01 ) sec TotalTime[MatrixElems] (3)= ( 2.845519e+00 ) sec MeanTimeInMatrixElems = ( 2.845519e+00 ) sec [Min,Max]TimeInMatrixElems = [ 2.845519e+00 , 2.845519e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.763276e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.779889e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.842504e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372328e-02 +- 1.131740e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 5.445383e-03 , 7.884562e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.194670e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = FLOAT (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Internal loops fptype_sv = VECTOR[8] (AVX2) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 2.929596e-01 ) sec TotalTime[Rambo+ME] (23)= ( 2.650706e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.788902e-02 ) sec TotalTime[Rambo] (2)= ( 9.624017e-02 ) sec TotalTime[MatrixElems] (3)= ( 1.688304e-01 ) sec MeanTimeInMatrixElems = ( 1.688304e-01 ) sec [Min,Max]TimeInMatrixElems = [ 1.688304e-01 , 1.688304e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.789626e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.977919e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 3.105412e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372471e-02 +- 1.132959e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 5.997265e-03 , 3.883831e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.203501e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000295 sec 0b MemAlloc : 0.015355 sec 0c GenCreat : 0.000865 sec 0d SGoodHel : 0.000040 sec 1a GenSeed : 0.000009 sec 1b GenRnGen : 0.027880 sec 2a RamboIni : 0.005198 sec 2b RamboFin : 0.091043 sec 3a SigmaKin : 0.168830 sec 4a DumpLoop : 0.003573 sec 8a CompStat : 0.003527 sec 9a GenDestr : 0.000074 sec 9b DumpScrn : 0.000151 sec 9c DumpJson : 0.000008 sec TOTAL : 0.316847 sec TOTAL (123) : 0.292960 sec TOTAL (23) : 0.265071 sec TOTAL (1) : 0.027889 sec TOTAL (2) : 0.096240 sec TOTAL (3) : 0.168830 sec ***********************************************************************
./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = FLOAT (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[16] Momenta memory layout = AOSOA[16] Internal loops fptype_sv = VECTOR[16] (AVX512F) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 1.230690e-01 ) sec TotalTime[Rambo+ME] (23)= ( 9.521992e-02 ) sec TotalTime[RndNumGen] (1)= ( 2.784907e-02 ) sec TotalTime[Rambo] (2)= ( 9.479249e-02 ) sec TotalTime[MatrixElems] (3)= ( 4.274280e-04 ) sec MeanTimeInMatrixElems = ( 4.274280e-04 ) sec [Min,Max]TimeInMatrixElems = [ 4.274280e-04 , 4.274280e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 4.260115e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 5.506075e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.226611e+09 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 0.000000e+00 +- 0.000000e+00 ) GeV^0 [Min,Max]MatrixElemValue = [ 0.000000e+00 , 0.000000e+00 ] GeV^0 StdDevMatrixElemValue = ( 0.000000e+00 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
Result ok in float AVX512. Not faster than float AVX2. ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = FLOAT (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[16] Momenta memory layout = AOSOA[16] Internal loops fptype_sv = VECTOR[16] (AVX512F) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 2.983189e-01 ) sec TotalTime[Rambo+ME] (23)= ( 2.705257e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.779320e-02 ) sec TotalTime[Rambo] (2)= ( 9.526685e-02 ) sec TotalTime[MatrixElems] (3)= ( 1.752588e-01 ) sec MeanTimeInMatrixElems = ( 1.752588e-01 ) sec [Min,Max]TimeInMatrixElems = [ 1.752588e-01 , 1.752588e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.757475e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.938034e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.991507e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372028e-02 +- 1.132663e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.032828e-03 , 3.488652e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.201354e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Internal loops fptype_sv = VECTOR[4] (AVX2) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.702728e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.425338e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.773900e-02 ) sec TotalTime[Rambo] (2)= ( 9.132473e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.512090e-01 ) sec MeanTimeInMatrixElems = ( 2.512090e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.512090e-01 , 2.512090e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.415951e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.530617e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.087059e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Internal loops fptype_sv = VECTOR[8] (AVX512F) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.759226e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.481836e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.773900e-02 ) sec TotalTime[Rambo] (2)= ( 9.471789e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.534657e-01 ) sec MeanTimeInMatrixElems = ( 2.534657e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.534657e-01 , 2.534657e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.394670e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.505780e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.068477e+06 ) sec^-1 ***********************************************************************
…e than AVX2" This reverts commit c65008e.
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l 1239 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Internal loops fptype_sv = VECTOR[4] (AVX2) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.714090e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.437035e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.770548e-02 ) sec TotalTime[Rambo] (2)= ( 9.189213e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.518114e-01 ) sec MeanTimeInMatrixElems = ( 2.518114e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.518114e-01 , 2.518114e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.411619e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.525408e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.082067e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** ./gcheck.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 7.326907e-03 ) sec TotalTime[Rambo+ME] (23)= ( 6.668723e-03 ) sec TotalTime[RndNumGen] (1)= ( 6.581840e-04 ) sec TotalTime[Rambo] (2)= ( 5.870857e-03 ) sec TotalTime[MatrixElems] (3)= ( 7.978660e-04 ) sec MeanTimeInMatrixElems = ( 7.978660e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.978660e-04 , 7.978660e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.155652e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 7.861895e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.571128e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
objdump -d -C check.exe | egrep 'vaddpd|vmul|vfmadd132pd|ymm' | wc -l 1246 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Internal loops fptype_sv = VECTOR[4] (AVX2) Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123)= ( 3.720043e-01 ) sec TotalTime[Rambo+ME] (23)= ( 3.441720e-01 ) sec TotalTime[RndNumGen] (1)= ( 2.783231e-02 ) sec TotalTime[Rambo] (2)= ( 9.215697e-02 ) sec TotalTime[MatrixElems] (3)= ( 2.520150e-01 ) sec MeanTimeInMatrixElems = ( 2.520150e-01 ) sec [Min,Max]TimeInMatrixElems = [ 2.520150e-01 , 2.520150e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123)= ( 1.409360e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 1.523332e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 2.080384e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.371958e-02 +- 1.132119e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374915e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.197419e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
…ype_v: g++ -O3 -std=c++11 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -march=core-avx2 -I/usr/local/cuda-11.1/include/ -c check.cc -o check.o check.cc: In function ‘std::unique_ptr<T []> hstMakeUnique(std::size_t) [with T = __vector(4) double; std::size_t = long unsigned int]’: check.cc:82:118: warning: ‘new’ of type ‘mgOnGpu::fptype_v’ {aka ‘__vector(4) double’} with extended alignment 32 [-Waligned-new=] _ptr<fptype_v[]> hstMakeUnique(std::size_t N) { return std::unique_ptr<fptype_v[]>{ new fptype_v[N/neppV]() }; }; ^ check.cc:82:118: note: uses ‘void* operator new [](std::size_t)’, which does not have an alignment parameter check.cc:82:118: note: use ‘-faligned-new’ to enable C++17 over-aligned new support
…do it later Now revert this, complete some xxx things first, and later do this move
…ut will do it later" This reverts commit fb43a4b.
Fix all conflicts. All tests pass in epoch1 (avx512 and none) and epoch2.
Fix conflicts in epoch1 testxxx.cc and CPPProcess.cc
Fix conflicts epoch1 CPPProcess.cc ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[1] == SCALAR (no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.312772e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.154965 sec real 0m7.164s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 7.292815e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.955000 sec real 0m1.246s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[4] (AVX512F) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.715700e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.620541 sec real 0m3.630s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 7.299221e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.168761 sec real 0m1.465s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.131055e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.925395 sec real 0m7.935s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 7.399876e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.915813 sec real 0m1.208s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[1] == SCALAR (no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.311938e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.160762 sec real 0m7.170s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 7.165268e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.740539 sec real 0m1.032s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[4] (AVX512F) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.716124e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.622706 sec real 0m3.633s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 7.270777e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.774117 sec real 0m1.067s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.138079e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.891635 sec real 0m7.901s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 7.277430e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.966146 sec real 0m1.258s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc Baseline performance after the last(?) merge of testxxx. CUDA in epoch1 is always around 1-2% slower than in epoch2, due to the SIMD changes in C++. ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[1] == SCALAR (no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.317036e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.141695 sec real 0m7.152s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 7.308881e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.727713 sec real 0m1.018s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[4] (AVX512F) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.883989e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.597827 sec real 0m3.607s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 7.252899e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.731646 sec real 0m1.022s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.147018e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.843119 sec real 0m7.852s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 7.399451e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.730010 sec real 0m1.023s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
…osmetics (fix a comment)
…g bug fix?) If MGONGPU_CPPSIMD is not defined, then __AVX2__ and similar macros must be ignored!
…mdSymSummary.sh for avx in none sse4 avx2 512y 512z; do f=./build.$avx/CPPProcess.o; ./simdSymSummary.sh $f; done === Symbols in ./build.none/CPPProcess.o === (~sse4: 611) (avx2: 0) (512y: 0) (512z: 0) === Symbols in ./build.sse4/CPPProcess.o === (~sse4: 3278) (avx2: 0) (512y: 0) (512z: 0) === Symbols in ./build.avx2/CPPProcess.o === (~sse4: 0) (avx2: 2766) (512y: 0) (512z: 0) === Symbols in ./build.512y/CPPProcess.o === (~sse4: 0) (avx2: 2597) (512y: 94) (512z: 0) === Symbols in ./build.512z/CPPProcess.o === (~sse4: 0) (avx2: 1182) (512y: 208) (512z: 2035) for avx in none sse4 avx2 512y 512z; do f=./build.$avx/CPPProcess.o; ./simdSymSummary.sh -helamps $f; done === Symbols in ./build.none/CPPProcess.o === (~sse4: 536) (avx2: 0) (512y: 0) (512z: 0) === Symbols in ./build.sse4/CPPProcess.o === (~sse4: 3145) (avx2: 0) (512y: 0) (512z: 0) === Symbols in ./build.avx2/CPPProcess.o === (~sse4: 0) (avx2: 2500) (512y: 0) (512z: 0) === Symbols in ./build.512y/CPPProcess.o === (~sse4: 0) (avx2: 2369) (512y: 77) (512z: 0) === Symbols in ./build.512z/CPPProcess.o === (~sse4: 0) (avx2: 979) (512y: 189) (512z: 2015)
The idea is to distinguish between AVX512 on xmm/ymm and AVX512 on zmm registers. The former ("AVX256" in LHCb) is the fastest option. The latter is slower, probably because it causes a closk slowdown? (to be demonstrated) Note in any case that '512z' seems to have many more instructions in total (why?) === Symbols in ./build.512y/CPPProcess.o === (~sse4: 0) (avx2: 2597) (512y: 94) (512z: 0) === Symbols in ./build.512z/CPPProcess.o === (~sse4: 0) (avx2: 1182) (512y: 208) (512z: 2035)
Current baseline performance on all avx modes ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[1] ('none': scalar, no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.313578e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.158000 sec real 0m7.168s === Symbols in CPPProcess.o === (~sse4: 611) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.467718e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.769639 sec real 0m1.073s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.479539e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.896227 sec real 0m4.906s === Symbols in CPPProcess.o === (~sse4: 3278) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.490480e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.695938 sec real 0m3.706s === Symbols in CPPProcess.o === (~sse4: 0) (avx2: 2766) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.912842e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.571775 sec real 0m3.581s === Symbols in CPPProcess.o === (~sse4: 0) (avx2: 2597) (512y: 94) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.702956e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.014852 sec real 0m4.024s === Symbols in CPPProcess.o === (~sse4: 0) (avx2: 1182) (512y: 208) (512z: 2035) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.147804e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.834120 sec real 0m7.843s === Symbols in CPPProcess.o === (~sse4: 561) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA EvtsPerSec[MatrixElems] (3) = ( 6.520156e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.763560 sec real 0m1.068s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
Fix conflicts: epoch1 CPPProcess.cc and throughput12.sh Baseline performance after the merge: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[1] ('none': scalar, no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.307178e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.202903 sec real 0m7.213s === Symbols in CPPProcess.o === (~sse4: 617) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 7.196734e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.748795 sec real 0m1.051s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.515310e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.851119 sec real 0m4.861s === Symbols in CPPProcess.o === (~sse4: 3264) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.451217e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.709405 sec real 0m3.719s === Symbols in CPPProcess.o === (~sse4: 0) (avx2: 2770) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.842292e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.604947 sec real 0m3.615s === Symbols in CPPProcess.o === (~sse4: 0) (avx2: 2600) (512y: 94) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.727711e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.005770 sec real 0m4.016s === Symbols in CPPProcess.o === (~sse4: 0) (avx2: 1184) (512y: 208) (512z: 2040) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.146197e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.847846 sec real 0m7.860s === Symbols in CPPProcess.o === (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 7.293983e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.747671 sec real 0m1.050s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
Fix various conflicts. Baseline double performance ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[1] ('none': scalar, no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.305953e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.184897 sec real 0m7.195s === Symbols in CPPProcess.o === (~sse4: 617) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 6.997065e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.906774 sec real 0m1.209s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.512330e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.854177 sec real 0m4.864s === Symbols in CPPProcess.o === (~sse4: 3264) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.446023e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.712585 sec real 0m3.722s === Symbols in CPPProcess.o === (~sse4: 0) (avx2: 2770) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.799523e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.602933 sec real 0m3.613s === Symbols in CPPProcess.o === (~sse4: 0) (avx2: 2600) (512y: 94) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512 vector width) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.718682e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.009003 sec real 0m4.018s === Symbols in CPPProcess.o === (~sse4: 0) (avx2: 1184) (512y: 208) (512z: 2040) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.147329e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.841047 sec real 0m7.851s === Symbols in CPPProcess.o === (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 7.151338e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.878278 sec real 0m1.170s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This merges together klas2 #132 (replacing klas #72) and epoch12 #151.
It will replace #72 and #132. Open this as WIP.