Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Recompute GPU branch efficiency with a non-dummy random helicity/color selection #609

Open
valassi opened this issue Mar 28, 2023 · 0 comments

Comments

@valassi
Copy link
Member

valassi commented Mar 28, 2023

Recompute GPU branch efficiency with a non-dummy random helicity/color selection. This is a subissue of #608 - itself related to #607, thanks to a chat with @zeniheisser

The random choices of color and helicity intrinsecally introduce some stochastic branching. This is expected to degrade data parallel performance. On GPUs, some branch efficinecy lower than 100% should appear.

Note that currently all tests in the tput directory indicate 100% branch efficiency in gcheck.exe tests. Example

==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%

On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg/build.none_d_inl0_hrd0/gcheck.exe -p 64 256 1 OMP=
Process                     = SIGMA_SM_GG_TTXGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[Rmb+ME]     (23) = ( 3.484965e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 3.511950e+05                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 3.514727e+05                 )  sec^-1
MeanMatrixElemValue         = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     0.543039 sec
     2,065,090,824      cycles                    #    2.646 GHz                    
     3,221,928,837      instructions              #    1.56  insn per cycle         
       0.839652102 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
.........................................................................

This might be because 100% is printed out as an integer approximation (unlikely). More likely, it is because check.cc uses always 0 as the random number in input, so all GPU threads go through exactly the same "stochastic" branches (there is no randomness! all 0) and are effectively in lockstep.

Two things could be done here

  • First, some randomness could be introduced (on demand) in gcheck.exe when running the profiles: it would be enought t o populate the allrndhel and allrndcol arrays with real random numbers, rather than using 0. One could do a run with all 0 and a run with random numbers, and compare performances and profiles for branch efficiency,
  • Second, maybe more interesting, an actual profiling of gmadevent_cudacpp could be done.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant