Runners become offline #62

rezmuh · 2020-06-25T04:47:03Z

Hi, I could finally get actions runner working last night. My setup involves:

Kubernetes v1.16
Built with eksctl 0.22.0
RunnerDeployment (with 3 replicas)
Organization-wide
Custom runner image
nodeSelector
Github Personal Access Token

However, all of a sudden it stopped working now. All the runners seem to be offline from the Organization's actions view (as shown below)

As these runners seemed to stop working roughly after 12 hours after it first started working, is it more of a token expiration issue?

What should I do to avoid this issue?

Thanks

igorbrigadir · 2020-06-25T09:50:00Z

Custom runner image

Have you updated your image to the latest runner version? v2.263.0 https://github.com/actions/runner/releases

I don't know if this is the reason why it's offline, but i noticed that my controller would get this sometimes, and it goes away if you rebuild your runner image / update to latest version:

ERROR... the object has been modified; please apply your changes to the latest version and try again"}

rezmuh · 2020-06-25T09:52:25Z

hi, yes my docker image is based on summerwind/actions-runner:v2.263.0

how often do you get that issue @igorbrigadir?

igorbrigadir · 2020-06-25T10:00:28Z

Very rarely - a runner was not used at all for a while, and maybe it was related to actions/runner#289

Updating my custom runner image, removing and adding the runners back solved it for me.

summerwind · 2020-06-25T13:01:15Z

@rezmuh Thank you for the reporting!

I don't think this is a token expiring. In order to get a more complete picture of the situation, could you please provide us with the results of the following commands?

$ kubectl describe pods "(The name of the runner pod)"
$ kubectl logs "(The name of the runner pod)" -c runner

rezmuh · 2020-06-25T14:23:49Z

OK, i'll report back as soon as I have all runners being offline

rezmuh · 2020-06-26T10:46:36Z

This is the output of kubectl describe pod: https://gist.github.com/rezmuh/9bcc2502ca5f1a51ee0ee7c412d5d1fd

And here's the output of kubectl logs <pod> -c runner: https://gist.github.com/rezmuh/f09ba992bc86dec562ee6c2e90cda408

So it appears that the runner is offline after completing one of the jobs but then it doesn't create a new runner.

rezmuh · 2020-06-26T11:34:29Z

What I think had just happened though, was that I had 5 runners and these 5 runners were all running. And there were even a few pipelines in the queue. However, when the running runners finishes the task, it went offline until all 5 went offline and new ones were not created

summerwind · 2020-06-27T12:19:32Z

Thank you for the information! I think something is happening on the controller side.
Can you provide the log of the controller before and after the runner was not started as follows?

$ kubectl logs -n actions-runner-system ${CONTROLLER_MANAGER_POD} -c manager

rezmuh · 2020-07-02T09:40:44Z

Hi @summerwind I finally got some runners being offline again. I now have 5 offline runners and 5 available runners. Here's the log you requested: https://gist.github.com/rezmuh/fa1c90821c509d5380ad0fc22ab35e53

By looking at the logs briefly, it seemed to see that there are still 10 available runners though.

kaykhancheckpoint · 2020-07-09T12:22:10Z

Im also experiencing this, i noticed it when i cancelled the workflow from the github ui, when you do that it seems to go offline and never restarts properly...?

So could it have something to do with forcefully cancelling a workflow via the gituhub ui?

summerwind · 2020-07-15T13:26:52Z

I don't have enough bandwidth for this. But I will see if I can reproduce it in a way that can be cancelled while Workflow is running in the next few days.

kuuji · 2020-08-18T15:20:35Z

Im also experiencing this, i noticed it when i cancelled the workflow from the github ui, when you do that it seems to go offline and never restarts properly...?

So could it have something to do with forcefully cancelling a workflow via the github ui?

This has happened to me even when I don't cancel a workflow. The runner container shows up as completed, the docker one is still up. The pod is effectively 1/2 healthy.

Nothing valuable in the logs, the controller seems to think the runner are all healthy. This issue has been haunting me for a while and I haven't been able to put my finger on what the cause is exactly.

Let me know if there's something you'd like me to do for the next time it happens @summerwind . I can collect a bunch of logs and metrics for you.

Fyi I'm using the org runner, with a githup application (not a personal token).

kuuji · 2020-08-19T18:10:01Z

Alright it's happening again for me. 4/5 runners are down. As I mentioned earlier, the docker container is up and the runner is down. Probably happened after an action job.

Removing the pods does not fix the issue. The controller seems to be in a broken state.

In the controller's logs I see a lot of this

 2020-08-19T15:44:14.122Z    DEBUG    controller-runtime.controller    Successfully Reconciled    {"controller": "runner", "request": "ci/ci-r
 unners-p45s6-lqlqw"}

And a in the middle of the sea of that I also saw this that seems more useful

 2020-08-19T15:44:14.033Z    ERROR    controller-runtime.controller    Reconciler error    {"controller": "runnerreplicaset", "request": "ci/c
 i-runners-p45s6", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"ci-runners-p45s6\": the object has be
 en modified; please apply your changes to the latest version and try again"}
 github.com/go-logr/zapr.(*zapLogger).Error
     /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
 k8s.io/apimachinery/pkg/util/wait.JitterUntil
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
 k8s.io/apimachinery/pkg/util/wait.Until
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88

edit more logs

 2020-08-18T20:25:59.788Z    ERROR    controller-runtime.controller    Reconciler error    {"controller": "runner", "request": "ci/ci-runners-
 github.com/go-logr/zapr.(*zapLogger).Error
     /go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
 k8s.io/apimachinery/pkg/util/wait.JitterUntil
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
 k8s.io/apimachinery/pkg/util/wait.Until
     /go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88

kaykhancheckpoint · 2020-08-20T08:19:47Z

Yeah i also found deleting the pods does not fix this issue, you have to tear it down and rebuild it for it to work. Which is not an ideal fix.

kuuji · 2020-09-21T20:20:29Z

Any update on this? @summerwind or @mumoshu any chance any of you could take a look at this?

This is still happening to me multiple times a day and my only fix right now is to restart the controller.

Edit: This is actually more rare. I had an issue with the autoscaler that made it seem like this.

kaykhancheckpoint · 2020-09-23T07:52:30Z

Ye im also waiting for an update on this. I believe my issue is related #69

kuuji · 2020-09-23T18:31:38Z

I run a custom image without issues @kaykhancheckpoint. This only happens to me every few days. And restarting the controller fixes it.

kaykhancheckpoint · 2020-09-23T18:40:56Z

@kuuji yeah i can run a custom image but it fails every so often, runner become offline and they never start.

mumoshu · 2020-09-23T23:38:58Z

@kaykhancheckpoint Just curious, but how much cpu is your pod permitted to use? I'm asking because I hear about the case that slow runner can fail auto-updating the runner agent binary and that makes the runner failing. To be honest I'm not sure how it is kept offline(it should definitely be recreated by the controller once failed), but that may be another issue.

Anyways, if your issue is coming from the instability of autoupdates, it's going to be fixed via #99

kaykhancheckpoint · 2020-09-28T15:21:36Z

Ye i don't think it was to do with cpu, for me i believe its to do with autoupdates

rezmuh · 2020-09-28T15:34:24Z

I think there are (at least) two different issues

Runners become offline because of autoupdates. For this one, since I use a custom runner, I will have to wait until summerwind's image in dockerhub is updated so that I can rebuild my custom runner. Only then I can delete existing runnersets and build a new one with custom runnerr.
Runners become offline intermittently. I haven't been able to find any correlation between CPU time (I don't set any limit in the CPU for the pods), succession of the last jobs or anything. I have no clue on this one. This doesn't happen as often in the past month or so since the issue on autoupdates happen more frequent than this one.

bagel-dawg · 2020-10-01T00:20:23Z

I can confirm that I am also seeing this issue. I am left with a pod running only the DIND container, and the runner container exiting on code 0. GitHub shows the worker as offline. Only fix is to re-create the deployment. I am also using a custom runner from (as of yesterday) summerwind/actions-runner:latest

kuuji · 2020-10-02T16:03:30Z

I have been running pretty smoothly for the past week or so. I've bumped the request and limit of the controller and of the runners. I pretty much doubled it for the controller (based on what the default was) and my runners have pretty high limit due to my CI needs.

Fyi @rezmuh not having request/limit doesn't mean you won't hit CPU issues, it's actually more likely that you will hit some issues without these set.
Because kubernetes use %used based on the request to schedule pods, if you don't have them setup, kubernetes might schedule all your pods on the same node since it won't know how much they consume.
I'd advise setting these fairly high.

Nuru · 2020-10-27T22:04:54Z

I see runners offline frequently. Using summerwind/actions-runner-dind:v2.273.5 when a job finishes, it can take up to 10 minutes to respawn the runner. During the outage, the controller will log normal messages, including such as

controllers.RunnerReplicaSet	debug	{"runner": "actions-runner-system/action-runner-repo-hk2x7", "desired": 1, "available": 1}
controller-runtime.controller	Successfully Reconciled	{"controller": "runnerreplicaset", "request": "actions-runner-system/action-runner-repo-hk2x7"}

and then eventually

INFO	controllers.Runner	Deleted runner pod	{"runner": "actions-runner-system/action-runner-repo-hk2x7-qc24l", "repository": "Nuru/repo"}
DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runner", "request": "actions-runner-system/action-runner-repo-hk2x7-qc24l"}
DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runner", "request": "actions-runner-system/action-runner-repo-hk2x7-qc24l"}
DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Runner","namespace":"actions-runner-system","name":"action-runner-repo-hk2x7-qc24l","uid":"6597198e-54ac-4499-8510-066a167ca116","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"21755700"}, "reason": "PodDeleted", "message": "Deleted pod 'action-runner-repo-hk2x7-qc24l'"}
INFO	controllers.Runner	Created runner pod	{"runner": "actions-runner-system/action-runner-repo-hk2x7-qc24l", "repository": "Nuru/repo"}

I note the pod was created with the exact same, not, as expected with a different suffix.

stale · 2021-04-30T01:38:35Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rezmuh changed the title ~~Runner becomes offline~~ Runners become offline Jun 25, 2020

naka-gawa mentioned this issue Jul 30, 2020

Runner container does not restart even if the job ends normally #77

Closed

Nuru mentioned this issue Oct 22, 2020

Runners should be deployed in standard Deployments and ReplicaSets #133

Closed

theobolo mentioned this issue Dec 20, 2020

Too many GitHub's API requests reach the rate limit #206

Closed

ejhayes mentioned this issue Jan 6, 2021

Passing secrets to runner for use in workflow #245

Closed

valeramaniukIHM mentioned this issue Apr 23, 2021

Runner stuck in " Not configured" state #479

Closed

stale bot added the stale label Apr 30, 2021

stale bot closed this as completed May 14, 2021

TingluoHuang pushed a commit that referenced this issue Jan 12, 2023

Introduced JIT as a secret parameter to EphemeralRunner (#62)

37fd1fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runners become offline #62

Runners become offline #62

rezmuh commented Jun 25, 2020

igorbrigadir commented Jun 25, 2020

rezmuh commented Jun 25, 2020

igorbrigadir commented Jun 25, 2020

summerwind commented Jun 25, 2020

rezmuh commented Jun 25, 2020

rezmuh commented Jun 26, 2020

rezmuh commented Jun 26, 2020

summerwind commented Jun 27, 2020

rezmuh commented Jul 2, 2020

kaykhancheckpoint commented Jul 9, 2020 •

edited

Loading

summerwind commented Jul 15, 2020

kuuji commented Aug 18, 2020

kuuji commented Aug 19, 2020 •

edited

Loading

kaykhancheckpoint commented Aug 20, 2020

kuuji commented Sep 21, 2020 •

edited

Loading

kaykhancheckpoint commented Sep 23, 2020

kuuji commented Sep 23, 2020

kaykhancheckpoint commented Sep 23, 2020

mumoshu commented Sep 23, 2020

kaykhancheckpoint commented Sep 28, 2020

rezmuh commented Sep 28, 2020

bagel-dawg commented Oct 1, 2020

kuuji commented Oct 2, 2020

Nuru commented Oct 27, 2020

stale bot commented Apr 30, 2021

Runners become offline #62

Runners become offline #62

Comments

rezmuh commented Jun 25, 2020

igorbrigadir commented Jun 25, 2020

rezmuh commented Jun 25, 2020

igorbrigadir commented Jun 25, 2020

summerwind commented Jun 25, 2020

rezmuh commented Jun 25, 2020

rezmuh commented Jun 26, 2020

rezmuh commented Jun 26, 2020

summerwind commented Jun 27, 2020

rezmuh commented Jul 2, 2020

kaykhancheckpoint commented Jul 9, 2020 • edited Loading

summerwind commented Jul 15, 2020

kuuji commented Aug 18, 2020

kuuji commented Aug 19, 2020 • edited Loading

kaykhancheckpoint commented Aug 20, 2020

kuuji commented Sep 21, 2020 • edited Loading

kaykhancheckpoint commented Sep 23, 2020

kuuji commented Sep 23, 2020

kaykhancheckpoint commented Sep 23, 2020

mumoshu commented Sep 23, 2020

kaykhancheckpoint commented Sep 28, 2020

rezmuh commented Sep 28, 2020

bagel-dawg commented Oct 1, 2020

kuuji commented Oct 2, 2020

Nuru commented Oct 27, 2020

stale bot commented Apr 30, 2021

kaykhancheckpoint commented Jul 9, 2020 •

edited

Loading

kuuji commented Aug 19, 2020 •

edited

Loading

kuuji commented Sep 21, 2020 •

edited

Loading