Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Runners become offline #62

Closed
rezmuh opened this issue Jun 25, 2020 · 25 comments
Closed

Runners become offline #62

rezmuh opened this issue Jun 25, 2020 · 25 comments
Labels

Comments

@rezmuh
Copy link

rezmuh commented Jun 25, 2020

Hi, I could finally get actions runner working last night. My setup involves:

  • Kubernetes v1.16
  • Built with eksctl 0.22.0
  • RunnerDeployment (with 3 replicas)
  • Organization-wide
  • Custom runner image
  • nodeSelector
  • Github Personal Access Token

However, all of a sudden it stopped working now. All the runners seem to be offline from the Organization's actions view (as shown below)

Screen Shot 2020-06-25 at 11 45 40

As these runners seemed to stop working roughly after 12 hours after it first started working, is it more of a token expiration issue?

What should I do to avoid this issue?

Thanks

@rezmuh rezmuh changed the title Runner becomes offline Runners become offline Jun 25, 2020
@igorbrigadir
Copy link

Custom runner image

Have you updated your image to the latest runner version? v2.263.0 https://github.com/actions/runner/releases

I don't know if this is the reason why it's offline, but i noticed that my controller would get this sometimes, and it goes away if you rebuild your runner image / update to latest version:

ERROR... the object has been modified; please apply your changes to the latest version and try again"}

@rezmuh
Copy link
Author

rezmuh commented Jun 25, 2020

hi, yes my docker image is based on summerwind/actions-runner:v2.263.0

how often do you get that issue @igorbrigadir?

@igorbrigadir
Copy link

Very rarely - a runner was not used at all for a while, and maybe it was related to actions/runner#289

Updating my custom runner image, removing and adding the runners back solved it for me.

@summerwind
Copy link
Contributor

@rezmuh Thank you for the reporting!

I don't think this is a token expiring. In order to get a more complete picture of the situation, could you please provide us with the results of the following commands?

$ kubectl describe pods "(The name of the runner pod)"
$ kubectl logs "(The name of the runner pod)" -c runner

@rezmuh
Copy link
Author

rezmuh commented Jun 25, 2020

OK, i'll report back as soon as I have all runners being offline

@rezmuh
Copy link
Author

rezmuh commented Jun 26, 2020

This is the output of kubectl describe pod: https://gist.github.com/rezmuh/9bcc2502ca5f1a51ee0ee7c412d5d1fd

And here's the output of kubectl logs <pod> -c runner: https://gist.github.com/rezmuh/f09ba992bc86dec562ee6c2e90cda408

So it appears that the runner is offline after completing one of the jobs but then it doesn't create a new runner.

@rezmuh
Copy link
Author

rezmuh commented Jun 26, 2020

What I think had just happened though, was that I had 5 runners and these 5 runners were all running. And there were even a few pipelines in the queue. However, when the running runners finishes the task, it went offline until all 5 went offline and new ones were not created

@summerwind
Copy link
Contributor

Thank you for the information! I think something is happening on the controller side.
Can you provide the log of the controller before and after the runner was not started as follows?

$ kubectl logs -n actions-runner-system ${CONTROLLER_MANAGER_POD} -c manager

@rezmuh
Copy link
Author

rezmuh commented Jul 2, 2020

Hi @summerwind I finally got some runners being offline again. I now have 5 offline runners and 5 available runners. Here's the log you requested: https://gist.github.com/rezmuh/fa1c90821c509d5380ad0fc22ab35e53

By looking at the logs briefly, it seemed to see that there are still 10 available runners though.

Screen Shot 2020-07-02 at 16 40 07

@kaykhancheckpoint
Copy link

kaykhancheckpoint commented Jul 9, 2020

Im also experiencing this, i noticed it when i cancelled the workflow from the github ui, when you do that it seems to go offline and never restarts properly...?

So could it have something to do with forcefully cancelling a workflow via the gituhub ui?

@summerwind
Copy link
Contributor

I don't have enough bandwidth for this. But I will see if I can reproduce it in a way that can be cancelled while Workflow is running in the next few days.

@kuuji
Copy link

kuuji commented Aug 18, 2020

Im also experiencing this, i noticed it when i cancelled the workflow from the github ui, when you do that it seems to go offline and never restarts properly...?

So could it have something to do with forcefully cancelling a workflow via the github ui?

This has happened to me even when I don't cancel a workflow. The runner container shows up as completed, the docker one is still up. The pod is effectively 1/2 healthy.

Nothing valuable in the logs, the controller seems to think the runner are all healthy. This issue has been haunting me for a while and I haven't been able to put my finger on what the cause is exactly.

Let me know if there's something you'd like me to do for the next time it happens @summerwind . I can collect a bunch of logs and metrics for you.

Fyi I'm using the org runner, with a githup application (not a personal token).

@kuuji
Copy link

kuuji commented Aug 19, 2020

Alright it's happening again for me. 4/5 runners are down. As I mentioned earlier, the docker container is up and the runner is down. Probably happened after an action job.

Removing the pods does not fix the issue. The controller seems to be in a broken state.

In the controller's logs I see a lot of this

 2020-08-19T15:44:14.122Z    DEBUG    controller-runtime.controller    Successfully Reconciled    {"controller": "runner", "request": "ci/ci-r
 unners-p45s6-lqlqw"}

And a in the middle of the sea of that I also saw this that seems more useful

 2020-08-19T15:44:14.033Z    ERROR    controller-runtime.controller    Reconciler error    {"controller": "runnerreplicaset", "request": "ci/c
 i-runners-p45s6", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"ci-runners-p45s6\": the object has be
 en modified; please apply your changes to the latest version and try again"}
 github.com/go-logr/zapr.(*zapLogger).Error
     /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
 k8s.io/apimachinery/pkg/util/wait.JitterUntil
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
 k8s.io/apimachinery/pkg/util/wait.Until
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88

edit more logs

 2020-08-18T20:25:59.788Z    ERROR    controller-runtime.controller    Reconciler error    {"controller": "runner", "request": "ci/ci-runners-
 github.com/go-logr/zapr.(*zapLogger).Error
     /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
 k8s.io/apimachinery/pkg/util/wait.JitterUntil
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
 k8s.io/apimachinery/pkg/util/wait.Until
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88

@kaykhancheckpoint
Copy link

Yeah i also found deleting the pods does not fix this issue, you have to tear it down and rebuild it for it to work. Which is not an ideal fix.

@kuuji
Copy link

kuuji commented Sep 21, 2020

Any update on this? @summerwind or @mumoshu any chance any of you could take a look at this?

This is still happening to me multiple times a day and my only fix right now is to restart the controller.

Edit: This is actually more rare. I had an issue with the autoscaler that made it seem like this.

@kaykhancheckpoint
Copy link

Ye im also waiting for an update on this. I believe my issue is related #69

@kuuji
Copy link

kuuji commented Sep 23, 2020

I run a custom image without issues @kaykhancheckpoint. This only happens to me every few days. And restarting the controller fixes it.

@kaykhancheckpoint
Copy link

@kuuji yeah i can run a custom image but it fails every so often, runner become offline and they never start.

@mumoshu
Copy link
Collaborator

mumoshu commented Sep 23, 2020

@kaykhancheckpoint Just curious, but how much cpu is your pod permitted to use? I'm asking because I hear about the case that slow runner can fail auto-updating the runner agent binary and that makes the runner failing. To be honest I'm not sure how it is kept offline(it should definitely be recreated by the controller once failed), but that may be another issue.

Anyways, if your issue is coming from the instability of autoupdates, it's going to be fixed via #99

@kaykhancheckpoint
Copy link

Ye i don't think it was to do with cpu, for me i believe its to do with autoupdates

@rezmuh
Copy link
Author

rezmuh commented Sep 28, 2020

I think there are (at least) two different issues

  • Runners become offline because of autoupdates. For this one, since I use a custom runner, I will have to wait until summerwind's image in dockerhub is updated so that I can rebuild my custom runner. Only then I can delete existing runnersets and build a new one with custom runnerr.
  • Runners become offline intermittently. I haven't been able to find any correlation between CPU time (I don't set any limit in the CPU for the pods), succession of the last jobs or anything. I have no clue on this one. This doesn't happen as often in the past month or so since the issue on autoupdates happen more frequent than this one.

@bagel-dawg
Copy link

I can confirm that I am also seeing this issue. I am left with a pod running only the DIND container, and the runner container exiting on code 0. GitHub shows the worker as offline. Only fix is to re-create the deployment. I am also using a custom runner from (as of yesterday) summerwind/actions-runner:latest

@kuuji
Copy link

kuuji commented Oct 2, 2020

I have been running pretty smoothly for the past week or so. I've bumped the request and limit of the controller and of the runners. I pretty much doubled it for the controller (based on what the default was) and my runners have pretty high limit due to my CI needs.

Fyi @rezmuh not having request/limit doesn't mean you won't hit CPU issues, it's actually more likely that you will hit some issues without these set.
Because kubernetes use %used based on the request to schedule pods, if you don't have them setup, kubernetes might schedule all your pods on the same node since it won't know how much they consume.
I'd advise setting these fairly high.

@Nuru
Copy link
Contributor

Nuru commented Oct 27, 2020

I see runners offline frequently. Using summerwind/actions-runner-dind:v2.273.5 when a job finishes, it can take up to 10 minutes to respawn the runner. During the outage, the controller will log normal messages, including such as

controllers.RunnerReplicaSet	debug	{"runner": "actions-runner-system/action-runner-repo-hk2x7", "desired": 1, "available": 1}
controller-runtime.controller	Successfully Reconciled	{"controller": "runnerreplicaset", "request": "actions-runner-system/action-runner-repo-hk2x7"}

and then eventually

INFO	controllers.Runner	Deleted runner pod	{"runner": "actions-runner-system/action-runner-repo-hk2x7-qc24l", "repository": "Nuru/repo"}
DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runner", "request": "actions-runner-system/action-runner-repo-hk2x7-qc24l"}
DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runner", "request": "actions-runner-system/action-runner-repo-hk2x7-qc24l"}
DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Runner","namespace":"actions-runner-system","name":"action-runner-repo-hk2x7-qc24l","uid":"6597198e-54ac-4499-8510-066a167ca116","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"21755700"}, "reason": "PodDeleted", "message": "Deleted pod 'action-runner-repo-hk2x7-qc24l'"}
INFO	controllers.Runner	Created runner pod	{"runner": "actions-runner-system/action-runner-repo-hk2x7-qc24l", "repository": "Nuru/repo"}

I note the pod was created with the exact same, not, as expected with a different suffix.

@stale
Copy link

stale bot commented Apr 30, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants