Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) #783

valassi · 2023-10-30T07:48:06Z

This is a followup of #733 and PR #706.

I have now added to all processes via CODEGEN the option to enable FPEs in the check.exe executable. I have rerun this for my usual 78 tests. I find 8 failures out for this, for ggttg (float/mixed) and gqttq

tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd1.txt:Floating Point Exception (CPU)
tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU)
tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU)
tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU)

The fact that this happens for the mixed mode makes me suspect that this is in the color algebra, for ggttg?

On the other hand for gqttq this is maybe in the Feynman diagrams, as it only happens in f and not in m?

Note: the madevent 'tmad' tests for these processes conversely have run ok for me so far. Maybe just because I have not tried enough events? By default FPEs are normally enabled in madevent (by linking fortran and c++). Here the difference is that I have explicitly enabled FPEs in the c++-only check.exe executables.

The text was updated successfully, but these errors were encountered:

…ble - some failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU) tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU) tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU) tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt:Floating Point Exception (CPU) tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd1.txt:Floating Point Exception (CPU) tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU) tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU) tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU) STARTED AT Sun Oct 29 10:38:58 PM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Sun Oct 29 11:10:35 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Sun Oct 29 11:22:39 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Sun Oct 29 11:31:39 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Sun Oct 29 11:34:48 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Sun Oct 29 11:37:55 PM CET 2023 [Status=0]

valassi · 2023-10-30T10:01:10Z

I am pinning this issue as high priority. There are FPEs in check.exe for ggttg and gqttq.

I have instead closed #733 for FPEs in madevent in nobm_pp_ttW.

Note however that there are ALSO still FPEs in check.exe in nobm_pp_ttW. I will follow these up here and also in PR #706.

…ble - some failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance STARTED AT Mon Oct 30 10:32:51 PM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Mon Oct 30 10:56:58 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Mon Oct 30 11:06:26 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Mon Oct 30 11:15:35 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Mon Oct 30 11:18:48 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Mon Oct 30 11:22:00 PM CET 2023 [Status=0]

…ble - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance (*NB OpenMP is now disabled by default!*) STARTED AT Fri Nov 3 10:06:44 AM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Fri Nov 3 01:30:11 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Fri Nov 3 01:55:47 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Fri Nov 3 02:05:25 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Fri Nov 3 02:08:40 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Fri Nov 3 02:11:53 PM CET 2023 [Status=0]

…ble - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance (*NB OpenMP is now again enabled by default*) (or maybe ~1-2% slower on average? anyway, keep OpenMP on as in the past)

valassi · 2023-11-08T15:39:58Z

tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU)

I have checked this again. Note that these FPEs only happen in very special corners now. For instance the one above gives

cd epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux> 

make cleanall; AVX=512y FPTYPE=f make -j
./check.exe  -p 1 16 1
[all OK!]

make cleanall; AVX=512z FPTYPE=f make -j
CUDACPP_RUNTIME_ENABLEFPE=1 ./check.exe  -p 1 16 1
[crashes!]

So: this one only crashes for float (not double and not mixed) and only for 512z (not for 512y)

valassi · 2023-11-08T16:15:52Z

Hi @roiser I added the comment above for you.

Via email I got a notification that you had sent a comment here asking if this was fixed? But I do not see it in the web interface. (You deleted it or there was a problem in email-github interface?)

Anyway. The issue is still pending at least in some cases. I guess it will need to be debugged one by one.

I will try to add this to the testsuite in my PR #792

…d for push/manual, disabled for PRs) Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds. I will add those in a more complex workflow with one codegen job and several build/test jobs.

valassi · 2023-11-08T17:57:36Z

I have implemented this in the extended CI, but the issue does not show because the github default nodes do not have AVX512...
https://github.com/valassi/madgraph4gpu/actions/runs/6802025406/job/18494211030


Execute build.avx2_f_inl0_hrd0/gcheck.exe -p 1 32 1
(SKIP missing build.avx2_f_inl0_hrd0/gcheck.exe)

(SKIP 512y which is not supported - no avx512vl in /proc/cpuinfo)

(SKIP 512z which is not supported - no avx512vl in /proc/cpuinfo)

We will need to run these tests on our own nodes with GPU and AVX512

… list of physics processes (test madgraph5#783?)

…d for push/manual, disabled for PRs) Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds. I will add those in a more complex workflow with one codegen job and several build/test jobs.

… list of physics processes (test madgraph5#783?)

…le - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance

… 3.5.2 - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance STARTED AT Thu Nov 9 05:26:21 PM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Thu Nov 9 05:54:46 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Thu Nov 9 06:05:38 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Thu Nov 9 06:15:05 PM CET 2023 [Status=2] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Thu Nov 9 06:18:20 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Thu Nov 9 06:21:32 PM CET 2023 [Status=0]

…d for push/manual, disabled for PRs) Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds. I will add those in a more complex workflow with one codegen job and several build/test jobs.

… list of physics processes (test madgraph5#783?)

… color choice if channelId==0 (this fixes the FPE madgraph5#783!) Before this change, I had identified the the source of the FPE with gdb: cd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu make -f cudacpp.mk debug -j CUDACPP_RUNTIME_ENABLEFPE=on gdb --args ./check.exe -p 1 8 1 (gdb) run Starting program: /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/check.exe -p 1 8 1 ... Program received signal SIGFPE, Arithmetic exception. mg5amcCpu::_ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i._omp_fn.0(void) () at CPPProcess.cc:1250 1250 const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] ); Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64 (gdb) p channelIdC $1 = -1

…adgraph5#783 (skip evt-by-evt color choice if channelId==0)

I checked that this indeed fixes tput tests for ggttg ./tput/teeThroughputX.sh -ggttg -fltonly -makeclean -makej

valassi · 2023-11-24T12:45:12Z

I believe I have found the source and a fix for (at least part of?) the floating point exceptions here. The fix is committed as part of PR #706.

I have debugged this as follows

    cd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu
    make -f cudacpp.mk debug -j
    CUDACPP_RUNTIME_ENABLEFPE=on gdb --args ./check.exe -p 1 8 1
    (gdb) run
    Starting program: /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/check.exe -p 1 8 1
    ...
    Program received signal SIGFPE, Arithmetic exception.
    mg5amcCpu::_ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i._omp_fn.0(void) () at CPPProcess.cc:1250
    1250              const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
    Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
    (gdb) p channelIdC
    $1 = -1

Essentially the problem is that line 1250 gives a division 0/0 because it should not be executed at all to start with! This belongs to event-by-event color choice, which should be skipped when channelId==0 and it was not. The fix simpoly consists in skipping the event-by-event color choice if channelId==0. I checked that this does bypass and fix the FPEs at least for a few cases like nobm_pp_ttW and gg_ttg. I am now rerunning more complete coverage tests.

(PS I have no idea why FPEs happen only for come combinations of processes, fp precision and AVX architecture... I have no idea, I guess that those that do not fail are "lucky").

valassi · 2023-11-24T13:20:43Z

Note, this sounds very much related to what @nscottnichols reported in #611, There is a channelIdC that becomes -1 and this line should not be executed.

valassi · 2023-11-24T13:21:50Z

Note also that bypassing the color choice is needed in cases such as reweighting #607

…ere are no FPEs any more!) STARTED AT Fri Nov 24 12:07:57 PM CET 2023 ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean ENDED(1) AT Fri Nov 24 02:47:13 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean ENDED(2) AT Fri Nov 24 03:06:19 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean ENDED(3) AT Fri Nov 24 03:16:14 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst ENDED(4) AT Fri Nov 24 03:19:35 PM CET 2023 [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst ENDED(5) AT Fri Nov 24 03:22:53 PM CET 2023 [Status=0] There used to be eight "Floating Point Exception (CPU)" errors in the logs, they have now all disappeared

NB1: the code builds ok for HRDCOD=0 (so madgraph5#695 does NOT affect this!), and runTest is ok in all P* (the ref files are there) NB2: a full tlau test is now also ok on this process, showing that all issues madgraph5#701 madgraph5#733 madgraph5#783 etc have been fixed tlau/lauX.sh -CPP nobm_pp_ttW.mad ... Cross-section : 1.276 +- 0.007916 pb In summary: this process should now be fully usable by ATLAS and other experiments.

This was referenced Oct 30, 2023

fix FPEs and debug nobm_pp_ttW for ATLAS #706

Merged

Three floating point exceptions in CPP launch of nobm_pp_ttW (FPE in COUP values in VVV1P0_1) #733

Closed

valassi pinned this issue Oct 30, 2023

valassi changed the title ~~Floating point exceptions in check.exe for ggttg (float/mixed) and gqttq~~ Floating point exceptions in check.exe for ggttg (float/mixed) and gqttq - and also in nobm_pp_ttW Oct 30, 2023

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 8, 2023

[actions/nobm] in .github/workflows/testsuite, add nobm_pp_ttW to the…

db6aa2c

… list of physics processes (test madgraph5#783?)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 8, 2023

[actions/nobm] in .github/workflows/testsuite, add nobm_pp_ttW to the…

bc855bc

… list of physics processes (test madgraph5#783?)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 9, 2023

[gpucpp] rerun 78 tput tests, with FPEs enabled in the check executab…

a4f7487

…le - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 10, 2023

[actions/nobm] in .github/workflows/testsuite, add nobm_pp_ttW to the…

ac27efd

… list of physics processes (test madgraph5#783?)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 24, 2023

[nobm] backport to codegen from nobm_pp_ttW.mad the bug fix for FPE m…

6ac6cf9

…adgraph5#783 (skip evt-by-evt color choice if channelId==0)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 24, 2023

[nobm] regenerate ggttg.mad with fix for FPE madgraph5#783

bfda86b

I checked that this indeed fixes tput tests for ggttg ./tput/teeThroughputX.sh -ggttg -fltonly -makeclean -makej

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 24, 2023

[nobm] regenerate all processes with the fix for FPE madgraph5#783

13efd8d

valassi self-assigned this Nov 24, 2023

valassi linked a pull request Nov 24, 2023 that will close this issue

fix FPEs and debug nobm_pp_ttW for ATLAS #706

Merged

valassi changed the title ~~Floating point exceptions in check.exe for ggttg (float/mixed) and gqttq - and also in nobm_pp_ttW~~ Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) Nov 24, 2023

valassi mentioned this issue Nov 24, 2023

Out-of-bounds memory access in random color choice #611

Closed

This was referenced Nov 24, 2023

Add the option to bypass the random choice of helicity and color (eg for reweighting) #607

Open

extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) #794

Merged

valassi closed this as completed in #706 Dec 16, 2023

valassi unpinned this issue Feb 2, 2024

valassi mentioned this issue Jul 2, 2024

Further fixes/improvements for iconfig-channel mappings in coloramps.h #877

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) #783

Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) #783

valassi commented Oct 30, 2023

valassi commented Oct 30, 2023

valassi commented Nov 8, 2023

valassi commented Nov 8, 2023

valassi commented Nov 8, 2023

valassi commented Nov 24, 2023 •

edited

Loading

valassi commented Nov 24, 2023

valassi commented Nov 24, 2023

Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) #783

Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) #783

Comments

valassi commented Oct 30, 2023

valassi commented Oct 30, 2023

valassi commented Nov 8, 2023

valassi commented Nov 8, 2023

valassi commented Nov 8, 2023

valassi commented Nov 24, 2023 • edited Loading

valassi commented Nov 24, 2023

valassi commented Nov 24, 2023

valassi commented Nov 24, 2023 •

edited

Loading