Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Encounter NIL Error when job in error stage with TTL value set #170

Open
mirocody opened this issue Oct 26, 2021 · 0 comments
Open

Encounter NIL Error when job in error stage with TTL value set #170

mirocody opened this issue Oct 26, 2021 · 0 comments

Comments

@mirocody
Copy link

mirocody commented Oct 26, 2021

Hi community,
I am trying to deploy a simple task using pytorchjob with the following yaml:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorchjob
  namespace: abc
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
          - args:
            - |+
              echo "Hello World!"
              python -u exception.py 
            command:
            - /usr/bin/env
            - bash
            - -c
            env:
            - name: LOCAL_RANK
              value: '0'
            image: <centos>
            name: pytorch
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
          - args:
            - |+
              echo "Hello World!"
              python -u exception.py 
            command:
            - /usr/bin/env
            - bash
            - -c
            env:
            - name: LOCAL_RANK
              value: '0'
            image: <centos>
            name: pytorch

  runPolicy:
    ttlSecondsAfterFinished: 864000

the scripy exception.py is nothing but just throw an exception to let the contaienr go to error status. Then the training operator pod logs the following:

E1026 03:50:23.343541       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 560 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x16da180, 0x27a0b00)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/runtime/runtime.go:48 +0x82
panic(0x16da180, 0x27a0b00)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).CleanupJob(0xc000e89320, 0xc000703618, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, 0x0, 0x18987c0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:401 +0xbd
github.com/kubeflow/common/pkg/controller.v1/common.(*JobController).ReconcileJobs(0xc000e89320, 0x18987c0, 0xc000703500, 0xc0008183c0, 0xc000f19600, 0x3, 0x3, 0xc000818720, 0x0, 0x0, ...)
        /go/pkg/mod/github.com/kubeflow/common@v0.3.7/pkg/controller.v1/common/job.go:147 +0x76d
github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch.(*PyTorchJobReconciler).Reconcile(0xc000e89320, 0x1b88fa0, 0xc000818270, 0xc000624f60, 0x13, 0xc000a1b590, 0x28, 0xc000818270, 0x40903b, 0xc000030000, ...)
        /workspace/pkg/controller.v1/pytorch/pytorchjob_controller.go:159 +0x83c
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x1750a40, 0xc000348340)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:263 +0x2f1
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000743ea0, 0x1b88ee0, 0xc000d26400, 0x0)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:235 +0x202
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1(0x1b88ee0, 0xc000d26400)
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:198 +0x4a
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0x37
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc00026c750)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001121f50, 0x1b46440, 0xc000818180, 0xc000d26401, 0xc000a36240)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:156 +0xa3
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00026c750, 0x3b9aca00, 0x0, 0x1, 0xc000a36240)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00, 0x0, 0x1986d01)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:185 +0xa6
k8s.io/apimachinery/pkg/util/wait.UntilWithContext(0x1b88ee0, 0xc000d26400, 0xc000c0eb10, 0x3b9aca00)
        /go/pkg/mod/k8s.io/apimachinery@v0.19.9/pkg/util/wait/wait.go:99 +0x57
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.2/pkg/internal/controller/controller.go:195 +0x4f6
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x14a257d]

It looks like the assumption in this line works that the completion time is not set when the clean up started.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant