Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) #783

Closed
valassi opened this issue Oct 30, 2023 · 7 comments · Fixed by #706
Assignees

Comments

@valassi
Copy link
Member

valassi commented Oct 30, 2023

This is a followup of #733 and PR #706.

I have now added to all processes via CODEGEN the option to enable FPEs in the check.exe executable. I have rerun this for my usual 78 tests. I find 8 failures out for this, for ggttg (float/mixed) and gqttq

tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd1.txt:Floating Point Exception (CPU)
tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU)
tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU)
tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU)

The fact that this happens for the mixed mode makes me suspect that this is in the color algebra, for ggttg?

On the other hand for gqttq this is maybe in the Feynman diagrams, as it only happens in f and not in m?

Note: the madevent 'tmad' tests for these processes conversely have run ok for me so far. Maybe just because I have not tried enough events? By default FPEs are normally enabled in madevent (by linking fortran and c++). Here the difference is that I have explicitly enabled FPEs in the c++-only check.exe executables.

valassi added a commit to valassi/madgraph4gpu that referenced this issue Oct 30, 2023
…ble - some failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance

tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt:Floating Point Exception (CPU)
tput/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd1.txt:Floating Point Exception (CPU)
tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0_bridge.txt:Floating Point Exception (CPU)
tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU)
tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd1.txt:Floating Point Exception (CPU)

STARTED  AT Sun Oct 29 10:38:58 PM CET 2023
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Sun Oct 29 11:10:35 PM CET 2023 [Status=2]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Sun Oct 29 11:22:39 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean
ENDED(3) AT Sun Oct 29 11:31:39 PM CET 2023 [Status=2]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst
ENDED(4) AT Sun Oct 29 11:34:48 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst
ENDED(5) AT Sun Oct 29 11:37:55 PM CET 2023 [Status=0]
@valassi valassi pinned this issue Oct 30, 2023
@valassi
Copy link
Member Author

valassi commented Oct 30, 2023

I am pinning this issue as high priority. There are FPEs in check.exe for ggttg and gqttq.

I have instead closed #733 for FPEs in madevent in nobm_pp_ttW.

Note however that there are ALSO still FPEs in check.exe in nobm_pp_ttW. I will follow these up here and also in PR #706.

@valassi valassi changed the title Floating point exceptions in check.exe for ggttg (float/mixed) and gqttq Floating point exceptions in check.exe for ggttg (float/mixed) and gqttq - and also in nobm_pp_ttW Oct 30, 2023
valassi added a commit to valassi/madgraph4gpu that referenced this issue Oct 31, 2023
…ble - some failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance

STARTED  AT Mon Oct 30 10:32:51 PM CET 2023
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Mon Oct 30 10:56:58 PM CET 2023 [Status=2]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Mon Oct 30 11:06:26 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean
ENDED(3) AT Mon Oct 30 11:15:35 PM CET 2023 [Status=2]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst
ENDED(4) AT Mon Oct 30 11:18:48 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst
ENDED(5) AT Mon Oct 30 11:22:00 PM CET 2023 [Status=0]
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 3, 2023
…ble - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance (*NB OpenMP is now disabled by default!*)

STARTED  AT Fri Nov  3 10:06:44 AM CET 2023
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Fri Nov  3 01:30:11 PM CET 2023 [Status=2]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Fri Nov  3 01:55:47 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean
ENDED(3) AT Fri Nov  3 02:05:25 PM CET 2023 [Status=2]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst
ENDED(4) AT Fri Nov  3 02:08:40 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst
ENDED(5) AT Fri Nov  3 02:11:53 PM CET 2023 [Status=0]
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 3, 2023
…ble - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance (*NB OpenMP is now again enabled by default*)

(or maybe ~1-2% slower on average? anyway, keep OpenMP on as in the past)
@valassi
Copy link
Member Author

valassi commented Nov 8, 2023

tput/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt:Floating Point Exception (CPU)

I have checked this again. Note that these FPEs only happen in very special corners now. For instance the one above gives

cd epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux> 

make cleanall; AVX=512y FPTYPE=f make -j
./check.exe  -p 1 16 1
[all OK!]

make cleanall; AVX=512z FPTYPE=f make -j
CUDACPP_RUNTIME_ENABLEFPE=1 ./check.exe  -p 1 16 1
[crashes!]

So: this one only crashes for float (not double and not mixed) and only for 512z (not for 512y)

@valassi
Copy link
Member Author

valassi commented Nov 8, 2023

Hi @roiser I added the comment above for you.

Via email I got a notification that you had sent a comment here asking if this was fixed? But I do not see it in the web interface. (You deleted it or there was a problem in email-github interface?)

Anyway. The issue is still pending at least in some cases. I guess it will need to be debugged one by one.

I will try to add this to the testsuite in my PR #792

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 8, 2023
…d for push/manual, disabled for PRs)

Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds.
I will add those in a more complex workflow with one codegen job and several build/test jobs.
@valassi
Copy link
Member Author

valassi commented Nov 8, 2023

I have implemented this in the extended CI, but the issue does not show because the github default nodes do not have AVX512...
https://github.com/valassi/madgraph4gpu/actions/runs/6802025406/job/18494211030


Execute build.avx2_f_inl0_hrd0/gcheck.exe -p 1 32 1
(SKIP missing build.avx2_f_inl0_hrd0/gcheck.exe)

(SKIP 512y which is not supported - no avx512vl in /proc/cpuinfo)

(SKIP 512z which is not supported - no avx512vl in /proc/cpuinfo)

We will need to run these tests on our own nodes with GPU and AVX512

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 8, 2023
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 8, 2023
…d for push/manual, disabled for PRs)

Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds.
I will add those in a more complex workflow with one codegen job and several build/test jobs.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 8, 2023
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 9, 2023
…le - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 10, 2023
… 3.5.2 - usual failures in ggttg f/m and gqttq f (madgraph5#783), no change in performance

STARTED  AT Thu Nov  9 05:26:21 PM CET 2023
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Thu Nov  9 05:54:46 PM CET 2023 [Status=2]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Thu Nov  9 06:05:38 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean
ENDED(3) AT Thu Nov  9 06:15:05 PM CET 2023 [Status=2]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst
ENDED(4) AT Thu Nov  9 06:18:20 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst
ENDED(5) AT Thu Nov  9 06:21:32 PM CET 2023 [Status=0]
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 10, 2023
…d for push/manual, disabled for PRs)

Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds.
I will add those in a more complex workflow with one codegen job and several build/test jobs.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 10, 2023
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 24, 2023
… color choice if channelId==0 (this fixes the FPE madgraph5#783!)

Before this change, I had identified the the source of the FPE with gdb:

cd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu
make -f cudacpp.mk debug -j
CUDACPP_RUNTIME_ENABLEFPE=on gdb --args ./check.exe -p 1 8 1
(gdb) run
Starting program: /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/check.exe -p 1 8 1
...
Program received signal SIGFPE, Arithmetic exception.
mg5amcCpu::_ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i._omp_fn.0(void) () at CPPProcess.cc:1250
1250              const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
(gdb) p channelIdC
$1 = -1
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 24, 2023
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 24, 2023
I checked that this indeed fixes tput tests for ggttg
./tput/teeThroughputX.sh -ggttg -fltonly -makeclean -makej
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 24, 2023
@valassi valassi self-assigned this Nov 24, 2023
@valassi valassi linked a pull request Nov 24, 2023 that will close this issue
@valassi
Copy link
Member Author

valassi commented Nov 24, 2023

I believe I have found the source and a fix for (at least part of?) the floating point exceptions here. The fix is committed as part of PR #706.

I have debugged this as follows

    cd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu
    make -f cudacpp.mk debug -j
    CUDACPP_RUNTIME_ENABLEFPE=on gdb --args ./check.exe -p 1 8 1
    (gdb) run
    Starting program: /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu/check.exe -p 1 8 1
    ...
    Program received signal SIGFPE, Arithmetic exception.
    mg5amcCpu::_ZN9mg5amcCpu8sigmaKinEPKdS1_S1_S1_PdjS2_S2_PiS3_i._omp_fn.0(void) () at CPPProcess.cc:1250
    1250              const bool okcol = allrndcol[ievt] < ( targetamp[icolC][ieppV] / targetamp[ncolor - 1][ieppV] );
    Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64
    (gdb) p channelIdC
    $1 = -1

Essentially the problem is that line 1250 gives a division 0/0 because it should not be executed at all to start with! This belongs to event-by-event color choice, which should be skipped when channelId==0 and it was not. The fix simpoly consists in skipping the event-by-event color choice if channelId==0. I checked that this does bypass and fix the FPEs at least for a few cases like nobm_pp_ttW and gg_ttg. I am now rerunning more complete coverage tests.

(PS I have no idea why FPEs happen only for come combinations of processes, fp precision and AVX architecture... I have no idea, I guess that those that do not fail are "lucky").

@valassi valassi changed the title Floating point exceptions in check.exe for ggttg (float/mixed) and gqttq - and also in nobm_pp_ttW Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) Nov 24, 2023
@valassi
Copy link
Member Author

valassi commented Nov 24, 2023

Note, this sounds very much related to what @nscottnichols reported in #611, There is a channelIdC that becomes -1 and this line should not be executed.

@valassi
Copy link
Member Author

valassi commented Nov 24, 2023

Note also that bypassing the color choice is needed in cases such as reweighting #607

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 25, 2023
…ere are no FPEs any more!)

STARTED  AT Fri Nov 24 12:07:57 PM CET 2023
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean
ENDED(1) AT Fri Nov 24 02:47:13 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean
ENDED(2) AT Fri Nov 24 03:06:19 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean
ENDED(3) AT Fri Nov 24 03:16:14 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst
ENDED(4) AT Fri Nov 24 03:19:35 PM CET 2023 [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -curhst
ENDED(5) AT Fri Nov 24 03:22:53 PM CET 2023 [Status=0]

There used to be eight "Floating Point Exception (CPU)" errors in the logs, they have now all disappeared
valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 25, 2023
NB1: the code builds ok for HRDCOD=0 (so madgraph5#695 does NOT affect this!), and runTest is ok in all P* (the ref files are there)

NB2: a full tlau test is now also ok on this process, showing that all issues madgraph5#701 madgraph5#733 madgraph5#783 etc have been fixed

tlau/lauX.sh -CPP nobm_pp_ttW.mad
...
     Cross-section :   1.276 +- 0.007916 pb

In summary: this process should now be fully usable by ATLAS and other experiments.
@valassi valassi unpinned this issue Feb 2, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant