Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[question] volcano pod is always pending after created #243

Closed
xiechengsheng opened this issue Aug 13, 2019 · 5 comments · Fixed by #245
Closed

[question] volcano pod is always pending after created #243

xiechengsheng opened this issue Aug 13, 2019 · 5 comments · Fixed by #245

Comments

@xiechengsheng
Copy link
Contributor

Hi there, when I submit volcano job according to the docx volcano job, the status of created pods is pending, have no idea to deal with this problem.
The reproduce steps:

1. create volcano operator
$ kubectl get deployment --all-namespaces | grep volcano
default        volcano-release-admission         1         1         1            1           8d
default        volcano-release-controllers       1         1         1            1           8d
default        volcano-release-scheduler         1         1         1            1           8d

2. submit job
$ arena submit volcanojob --name=demo
configmap/demo-volcanojob created
configmap/demo-volcanojob labeled
job.batch.volcano.sh/demo created
INFO[0001] The Job demo has been submitted successfully
INFO[0001] You can run `arena get demo --type volcanojob` to check the job status

3. get job
$ arena get --type volcanojob demo
STATUS: SUBMITTED
NAMESPACE: default
PRIORITY: N/A
TRAINING DURATION: 1m

NAME  STATUS   TRAINER     AGE  INSTANCE       NODE
demo  PENDING  VOLCANOJOB  1m   demo-task-0-0  N/A
demo  PENDING  VOLCANOJOB  1m   demo-task-1-0  N/A
demo  PENDING  VOLCANOJOB  1m   demo-task-2-0  N/A

4. describe the pod:
$ kubectl describe pod demo-task-0-0
]Name:               demo-task-0-0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             app=volcanojob
                    chart=volcanojob-0.0.1
                    createdBy=VolcanoJob
                    heritage=Tiller
                    release=demo
                    volcano-role=driver
                    volcano.sh/job-name=demo
                    volcano.sh/job-namespace=default
Annotations:        scheduling.k8s.io/group-name: demo
                    volcano.sh/job-name: demo
                    volcano.sh/job-version: 0
                    volcano.sh/task-spec: task-0
Status:             Pending
IP:
Controlled By:      Job/demo
Containers:
  task:
    Image:      ubuntu
    Port:       2222/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     250m
      memory:  128Mi
    Requests:
      cpu:        250m
      memory:     128Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4nz8b (ro)
Volumes:
  default-token-4nz8b:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4nz8b
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

/cc @SrinivasChilveri @k82cn need your help, thx~

@k82cn
Copy link
Contributor

k82cn commented Aug 13, 2019

@hzxuzhonghu , please help on this one :)

@hzxuzhonghu
Copy link
Contributor

@xiechengsheng It is because we make kube-batch into volcano volcano-sh/volcano#288, and rename the schedulerName from kube-batch to volcano.

So let me fix the docs here too.

@xiechengsheng
Copy link
Contributor Author

Hi, @hzxuzhonghu sorry to disturb you again because I really don't understand the current problem.

  • I met another problem which might be tiny. The question is how could I delete the installed volcano-release operator? Because I want to install another volcano-release in other namespaces, I must delete the current operator by helm del --purge volcano-release, but when I want to recreate a new operator by helm install --name volcano-release --namespace arena-system kubernetes-artifacts/volcano-operator, k8s tells me that "Error: customresourcedefinitions.apiextensions.k8s.io "queues.scheduling.incubator.k8s.io" already exists".

  • Then, I delete the resource by kubectl delete customresourcedefinitions.apiextensions.k8s.io queues.scheduling.incubator.k8s.io, and continue creating the volcano release operator, but there still exists many other components, such as podgroups.scheduling.incubator.k8s.io, commands.bus.volcano.sh, etc.... After I delete all the components and recreate the volcano release operator by "helm install --name volcano-release --namespace arena-system kubernetes-artifacts/volcano-operator", another error happens:

# kubectl get pod | grep  volcano-release
volcano-release123-admission-576d6979db-dm24w     0/1     CrashLoopBackOff   8          18m
volcano-release123-admission-init-hszjn           0/1     Completed          0          18m
volcano-release123-controllers-7bcc5b6d75-w8vd6   1/1     Running            0          18m
volcano-release123-scheduler-6d5c65d8d8-f8nzc     0/1     CrashLoopBackOff   8          18m
  • And, I dig into the pod logs:
# kubectl logs volcano-release123-scheduler-6d5c65d8d8-f8nzc
panic: failed init default queue, with err: queues.scheduling.sigs.dev is forbidden: User "system:serviceaccount:default:volcano-release123-scheduler" cannot create resource "queues" in API group "scheduling.sigs.dev" at the cluster scope

goroutine 1 [running]:
volcano.sh/volcano/pkg/scheduler/cache.newSchedulerCache(0xc0004a05a0, 0x1485f11, 0x7, 0x14853f7, 0x7, 0x0)
	/home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/cache/cache.go:272 +0x207e
volcano.sh/volcano/pkg/scheduler/cache.New(0xc0004a05a0, 0x1485f11, 0x7, 0x14853f7, 0x7, 0x10, 0xc0003ba750)
	/home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/cache/cache.go:69 +0x53
volcano.sh/volcano/pkg/scheduler.NewScheduler(0xc0004a05a0, 0x1485f11, 0x7, 0x7ffd10c9ff35, 0x29, 0x3b9aca00, 0x14853f7, 0x7, 0x12693c0, 0x1, ...)
	/home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:55 +0x5d
volcano.sh/volcano/cmd/scheduler/app.Run(0xc000390320, 0x1515810, 0x1515810)
	/home/travis/gopath/src/volcano.sh/volcano/cmd/scheduler/app/server.go:85 +0xde
main.main()
	/home/travis/gopath/src/volcano.sh/volcano/cmd/scheduler/main.go:62 +0x17d

Could you help me to solve the problem? And could you teach me the right way to delete a volcano release operator? Thanks in advance!
/cc @k82cn

@hzxuzhonghu
Copy link
Contributor

sorry for late

As you described, it seems that the deletion is not done cleanly.

panic: failed init default queue, with err: queues.scheduling.sigs.dev is forbidden: User "system:serviceaccount:default:volcano-release123-scheduler" cannot create resource "queues" in API group "scheduling.sigs.dev" at the cluster scope

This is an authz failure.

try this

kubectl get clusterrolebinding volcano-scheduler-role -oyaml

kubectl get clusterrole volcano-scheduler -oyaml


And also get the serviceAccount of volcano-scheduler pod.

@xiechengsheng
Copy link
Contributor Author

@hzxuzhonghu Hi, Thx for your reply, I have tried the command kubectl get serviceaccounts | grep volcano but get nothing, there is no clusterrolebinding/clusterrole/serviceaccount of volcano in cluster.

The reproduce step of my problem is :

1. delete a volcano release
# helm del --purge volcano-release
release "volcano-release" deleted

2. recreate it but get an error
# helm install --name volcano-release --namespace arena-system kubernetes-artifacts/volcano-operator
Error: customresourcedefinitions.apiextensions.k8s.io "commands.bus.volcano.sh" already exists

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants