Skip to content

Launcher and worker statuses do not correctly indicate the underlying states #90

@terrytangyuan

Description

@terrytangyuan

Launcher keeps crashing:

mpi-test-2-mpijob-launcher-lv2fx                                    1/1      CrashLoopBackOff             2          1m
mpi-test-2-mpijob-worker-0                                             1/1      Running                                0          1m
mpi-test-2-mpijob-worker-1                                             1/1       Running                                0          1m

However, from the launcher's log, one of the worker is the one that's failing and is killed (later found that it was due to OOM)

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real noticed that process rank 1 with PID 39 on node mpi-test-2-mpijob-worker-1 exited on signal 9 (Killed).
--------------------------------------------------------------------------

Here's the description for launcher job which does not indicate any abnormal events:

Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  14m   job-controller  Created pod: mpi-test-mpijob-launcher-m8kw6

The above problems could potentially be addressed by #12 (currently mpirun does not give us helpful error messages so maybe PMIx is a better here) and #54 (currently only launcher pod is shown as failing but the workers are actually failing). There are other solutions too but I just wanted to link to other existing issues.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions