-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
WIP: gg to ttgggg (2->6 process) #601
base: master
Are you sure you want to change the base?
Conversation
Later on however (now) clang has also increased in size
And cuda too
|
Update - cuda build is still running after more than one day
The clang++ build with inlining has been killed by oom
The clang++ build without inlining was still running but seemed stuck: high memory but 0 CPU? I saw that the AFS token had expired in between, so I ctrl-z stopped it, renewed the token and fg resumed it, but this caused a crash immediately afterwards
I will restart this one... |
The clang++ build without inlining finally completed! It took 32 hours to compile CPPProcess.o on lxplus
I will relaunch the build with inlining. Note instead that the cuda build is still ongoing... |
The clang build with inlining never completed successfully (on lxplus, my interactive process was logged out every time within one or two days, which I suspect being a symptom of an out of memory). As for cuda, the build is still running after one week! I will kill the process, it is unreasonable to keep it going longer
|
…ocess.cc which is 32MB) Note: the generation of gg_ttgggg.mad failed, killed by out-of-memory oom killer after ~1h30 dmesg -T | egrep -i 'killed process' [Fri Feb 24 21:45:56 2023] Out of memory: Killed process 2812622 (python3) total-vm:30208192kB, anon-rss:14254780kB, file-rss:4kB, shmem-rss:0kB, UID:14546 pgtables:58908kB oom_score_adj:0
…d (-makej -inl) [root@itscrd90 cudacpp]# grep -i 'killed process' /var/log/messages Feb 24 21:45:56 itscrd90.cern.ch kernel: Out of memory: Killed process 2812622 (python3) total-vm:30208192kB, anon-rss:14254780kB, file-rss:4kB, shmem-rss:0kB, UID:14546 pgtables:58908kB oom_score_adj:0 Feb 25 12:08:32 itscrd90.cern.ch kernel: Out of memory: Killed process 25738 (dbus-broker-lau) total-vm:19644kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:60kB oom_score_adj:200 Feb 25 12:08:32 itscrd90.cern.ch kernel: Out of memory: Killed process 2859216 (cudafe++) total-vm:4439180kB, anon-rss:2533172kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:8728kB oom_score_adj:0 Feb 25 12:09:59 itscrd90.cern.ch kernel: Out of memory: Killed process 2859218 (cudafe++) total-vm:4830956kB, anon-rss:2404060kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:9504kB oom_score_adj:0 Feb 25 12:12:26 itscrd90.cern.ch kernel: Out of memory: Killed process 2859211 (cudafe++) total-vm:4830956kB, anon-rss:1651848kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:9496kB oom_score_adj:0 Feb 25 12:17:51 itscrd90.cern.ch kernel: Out of memory: Killed process 2859172 (cc1plus) total-vm:5225996kB, anon-rss:3906132kB, file-rss:0kB, shmem-rss:0kB, UID:14546 pgtables:9800kB oom_score_adj:0 The first line is the failed generation of ggttgggg.mad yesterday. The next lines are the failed builds. NB: the builds failed already with inl0. I only have gg_ttgggg.sa/SubProcesses/P1_Sigma_sm_gg_ttxgggg/build.*hrd0 and none has a complete CPPProcess.o Will retry one by one as ./tput/throughputX.sh -ggttgggg -sa -512yonly -makeclean
…FLAGS+= -freport-bug" to prepare bug reports for internal compiler errors
I have rebased over upstream/master... I will probably close this MR as unmerged, but at least it's updated now. And I will cherry pick a few commits elseweher. |
…FVs and for compiling them as separate object files (related to splitting kernels)
…d MemoryAccessMomenta.h
…the P subdirectory (depends on npar) - build succeeds for cpp, link fails for cuda ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -Xcompiler -fPIC -c -x cu CPPProcess.cc -o CPPProcess_cuda.o ptxas fatal : Unresolved extern function '_ZN9mg5amcGpu14helas_VVV1P0_1EPKdS1_S1_dddPd'
…cuda tests succeed The build issues some warnings however nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'
…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'
…nd -inlLonly options
… to ease code generation
…y in the HELINL=L mode
…c++, a factor 3 slower for cuda... ./tput/teeThroughputX.sh -ggtt -makej -makeclean -inlLonly diff -u --color tput/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt tput/logs_ggtt_mad/log_ggtt_mad_d_inlL_hrd0.txt -Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.589473e+07 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 1.164485e+08 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 1.280951e+08 ) sec^-1 -MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 -TOTAL : 0.528239 sec -INFO: No Floating Point Exceptions have been reported - 2,222,057,027 cycles # 2.887 GHz - 3,171,868,018 instructions # 1.43 insn per cycle - 0.826440817 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inl0_hrd0/check_cuda.exe -p 2048 256 1 -==PROF== Profiling "sigmaKin": launch__registers_per_thread 214 +EvtsPerSec[Rmb+ME] (23) = ( 2.667135e+07 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.116115e+07 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.251573e+07 ) sec^-1 +MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 +TOTAL : 0.550450 sec +INFO: No Floating Point Exceptions have been reported + 2,272,219,097 cycles # 2.889 GHz + 3,361,475,195 instructions # 1.48 insn per cycle + 0.842685843 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inlL_hrd0/check_cuda.exe -p 2048 256 1 +==PROF== Profiling "sigmaKin": launch__registers_per_thread 190 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
…lates in HELINL=L mode
…t.mad of HelAmps.h in HELINL=L mode
…t.mad of CPPProcess.cc in HELINL=L mode
…P* (the source is the same but it must be compiled in each P* separately)
… complete its backport
git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc
…ild failed? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlL ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_INLINE_HELAMPS -Xcompiler -fPIC -c -x cu CPPProcess.cc -o build.cuda_d_inl1_hrd0/CPPProcess_cuda.o nvcc error : 'ptxas' died due to signal 9 (Kill signal) make[2]: *** [cudacpp.mk:754: build.cuda_d_inl1_hrd0/CPPProcess_cuda.o] Error 9 make[2]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make[1]: *** [makefile:142: build.cuda_d_inl1_hrd0/.cudacpplibs] Error 2 make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make: *** [makefile:282: bldcuda] Error 2 make: *** Waiting for unfinished jobs....
… build time is from cache ./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…mode (use that from the previous run, not from cache) ./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…factor x2 faster (c++? cuda?), runtime is 5-10% slower in C++, but 5-10% faster in cuda!? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlLonly diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt ... On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.338149e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02 ) sec^-1 -MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 2.242693 sec -INFO: No Floating Point Exceptions have been reported - 7,348,976,543 cycles # 2.902 GHz - 16,466,315,526 instructions # 2.24 insn per cycle - 2.591057214 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1 +EvtsPerSec[Rmb+ME] (23) = ( 4.063038e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02 ) sec^-1 +MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 +TOTAL : 2.552546 sec +INFO: No Floating Point Exceptions have been reported + 7,969,059,552 cycles # 2.893 GHz + 17,401,037,642 instructions # 2.18 insn per cycle + 2.954791685 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ... ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] -EvtsPerSec[Rmb+ME] (23) = ( 3.459662e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 3.835352e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02 ) sec^-1 MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 1.528240 sec +TOTAL : 1.378567 sec INFO: No Floating Point Exceptions have been reported - 4,140,408,789 cycles # 2.703 GHz - 9,072,597,595 instructions # 2.19 insn per cycle - 1.532357792 seconds time elapsed -=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:94048) (512y: 91) (512z: 0) + 3,738,350,469 cycles # 2.705 GHz + 8,514,195,736 instructions # 2.28 insn per cycle + 1.382567882 seconds time elapsed +=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:80619) (512y: 89) (512z: 0) -------------------------------------------------------------------------
…10-15% slower in both C++ and cuda diff -u --color tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt -Executing ' ./build.512y_d_inlL_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' +Executing ' ./build.512y_d_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' [OPENMPTH] omp_get_max_threads/nproc = 1/4 [NGOODHEL] ngoodhel/ncomb = 128/128 [XSECTION] VECSIZE_USED = 8192 @@ -401,10 +401,10 @@ [XSECTION] ChannelId = 1 [XSECTION] Cross section = 2.332e-07 [2.3322993086656014E-007] fbridge_mode=1 [UNWEIGHT] Wrote 303 events (found 1531 events) - [COUNTERS] PROGRAM TOTAL : 320.6913s - [COUNTERS] Fortran Overhead ( 0 ) : 4.5138s - [COUNTERS] CudaCpp MEs ( 2 ) : 316.1312s for 90112 events => throughput is 2.85E+02 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 0.0463s + [COUNTERS] PROGRAM TOTAL : 288.3304s + [COUNTERS] Fortran Overhead ( 0 ) : 4.4909s + [COUNTERS] CudaCpp MEs ( 2 ) : 283.7968s for 90112 events => throughput is 3.18E+02 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 0.0426s -Executing ' ./build.cuda_d_inlL_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' +Executing ' ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_ggttggg_x10_cudacpp > /tmp/avalassi/output_ggttggg_x10_cudacpp' [OPENMPTH] omp_get_max_threads/nproc = 1/4 [NGOODHEL] ngoodhel/ncomb = 128/128 [XSECTION] VECSIZE_USED = 8192 @@ -557,10 +557,10 @@ [XSECTION] ChannelId = 1 [XSECTION] Cross section = 2.332e-07 [2.3322993086656006E-007] fbridge_mode=1 [UNWEIGHT] Wrote 303 events (found 1531 events) - [COUNTERS] PROGRAM TOTAL : 19.6663s - [COUNTERS] Fortran Overhead ( 0 ) : 4.9649s - [COUNTERS] CudaCpp MEs ( 2 ) : 13.4667s for 90112 events => throughput is 6.69E+03 events/s - [COUNTERS] CudaCpp HEL ( 3 ) : 1.2347s + [COUNTERS] PROGRAM TOTAL : 18.0242s + [COUNTERS] Fortran Overhead ( 0 ) : 4.9891s + [COUNTERS] CudaCpp MEs ( 2 ) : 11.9530s for 90112 events => throughput is 7.54E+03 events/s + [COUNTERS] CudaCpp HEL ( 3 ) : 1.0821s
…arnings and runtime test failures in HELINL=0 There are still build failures in HELINL=L
…allCOUP2 instead of allCOUP) to FFV2_4_0 and FFV2_4_3, fixing build failures in HELINL=L
…d CI access, to fix the issues observed in ee_mumu I did not find an easier way to do this, because the model is known in the aloha caller but not at the time of aloha codegen
…one, COUP1/COUP2 instead of COUP; two, CI/CD instead of CD)
Fix conflicts: epochX/cudacpp/tput/teeThroughputX.sh epochX/cudacpp/tput/throughputX.sh
Fix conflicts: epochX/cudacpp/tput/teeThroughputX.sh epochX/cudacpp/tput/throughputX.sh
I regenerated gg_ttgggg with the helas codegen of PR #978. Using the HELINL=L option this still fails compilation on gcc. I guess it must be the color algebra that does not follow?
|
Also clang fails with a different error 255
|
Following the discussion at the last meeting, I started doing a few tests of gg to ttgggg. Here's a first WIP MR with some changes.
Note on codegen
NB: CPPProcess.cc is 32MB size and contains 15495 Feynman diagrams and a 720x720 color matrix
Note on builds of ggttgggg.sa:
PS1 currently cuda on itscrd90
PS2 currently clang on lxplus9