-
Notifications
You must be signed in to change notification settings - Fork 229
Open
Labels
Description
Launcher keeps crashing:
mpi-test-2-mpijob-launcher-lv2fx 1/1 CrashLoopBackOff 2 1m
mpi-test-2-mpijob-worker-0 1/1 Running 0 1m
mpi-test-2-mpijob-worker-1 1/1 Running 0 1m
However, from the launcher's log, one of the worker is the one that's failing and is killed (later found that it was due to OOM)
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real noticed that process rank 1 with PID 39 on node mpi-test-2-mpijob-worker-1 exited on signal 9 (Killed).
--------------------------------------------------------------------------
Here's the description for launcher job which does not indicate any abnormal events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 14m job-controller Created pod: mpi-test-mpijob-launcher-m8kw6
The above problems could potentially be addressed by #12 (currently mpirun does not give us helpful error messages so maybe PMIx is a better here) and #54 (currently only launcher pod is shown as failing but the workers are actually failing). There are other solutions too but I just wanted to link to other existing issues.