-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Runners become offline #62
Comments
Have you updated your image to the latest runner version? I don't know if this is the reason why it's offline, but i noticed that my controller would get this sometimes, and it goes away if you rebuild your runner image / update to latest version:
|
hi, yes my docker image is based on how often do you get that issue @igorbrigadir? |
Very rarely - a runner was not used at all for a while, and maybe it was related to actions/runner#289 Updating my custom runner image, removing and adding the runners back solved it for me. |
@rezmuh Thank you for the reporting! I don't think this is a token expiring. In order to get a more complete picture of the situation, could you please provide us with the results of the following commands?
|
OK, i'll report back as soon as I have all runners being offline |
This is the output of And here's the output of So it appears that the runner is offline after completing one of the jobs but then it doesn't create a new runner. |
What I think had just happened though, was that I had 5 runners and these 5 runners were all running. And there were even a few pipelines in the queue. However, when the running runners finishes the task, it went offline until all 5 went offline and new ones were not created |
Thank you for the information! I think something is happening on the controller side.
|
Hi @summerwind I finally got some runners being offline again. I now have 5 offline runners and 5 available runners. Here's the log you requested: https://gist.github.com/rezmuh/fa1c90821c509d5380ad0fc22ab35e53 By looking at the logs briefly, it seemed to see that there are still 10 available runners though. |
Im also experiencing this, i noticed it when i cancelled the workflow from the github ui, when you do that it seems to go offline and never restarts properly...? So could it have something to do with forcefully cancelling a workflow via the gituhub ui? |
I don't have enough bandwidth for this. But I will see if I can reproduce it in a way that can be cancelled while Workflow is running in the next few days. |
This has happened to me even when I don't cancel a workflow. The runner container shows up as completed, the docker one is still up. The pod is effectively 1/2 healthy. Nothing valuable in the logs, the controller seems to think the runner are all healthy. This issue has been haunting me for a while and I haven't been able to put my finger on what the cause is exactly. Let me know if there's something you'd like me to do for the next time it happens @summerwind . I can collect a bunch of logs and metrics for you. Fyi I'm using the org runner, with a githup application (not a personal token). |
Alright it's happening again for me. 4/5 runners are down. As I mentioned earlier, the docker container is up and the runner is down. Probably happened after an action job. Removing the pods does not fix the issue. The controller seems to be in a broken state. In the controller's logs I see a lot of this
And a in the middle of the sea of that I also saw this that seems more useful
edit more logs
|
Yeah i also found deleting the pods does not fix this issue, you have to tear it down and rebuild it for it to work. Which is not an ideal fix. |
Any update on this? @summerwind or @mumoshu any chance any of you could take a look at this? This is still happening to me multiple times a day and my only fix right now is to restart the controller. Edit: This is actually more rare. I had an issue with the autoscaler that made it seem like this. |
Ye im also waiting for an update on this. I believe my issue is related #69 |
I run a custom image without issues @kaykhancheckpoint. This only happens to me every few days. And restarting the controller fixes it. |
@kuuji yeah i can run a custom image but it fails every so often, runner become offline and they never start. |
@kaykhancheckpoint Just curious, but how much cpu is your pod permitted to use? I'm asking because I hear about the case that slow runner can fail auto-updating the runner agent binary and that makes the runner failing. To be honest I'm not sure how it is kept offline(it should definitely be recreated by the controller once failed), but that may be another issue. Anyways, if your issue is coming from the instability of autoupdates, it's going to be fixed via #99 |
Ye i don't think it was to do with cpu, for me i believe its to do with autoupdates |
I think there are (at least) two different issues
|
I can confirm that I am also seeing this issue. I am left with a pod running only the DIND container, and the runner container exiting on code 0. GitHub shows the worker as offline. Only fix is to re-create the deployment. I am also using a custom runner from (as of yesterday) summerwind/actions-runner:latest |
I have been running pretty smoothly for the past week or so. I've bumped the request and limit of the controller and of the runners. I pretty much doubled it for the controller (based on what the default was) and my runners have pretty high limit due to my CI needs. Fyi @rezmuh not having request/limit doesn't mean you won't hit CPU issues, it's actually more likely that you will hit some issues without these set. |
I see runners offline frequently. Using
and then eventually
I note the pod was created with the exact same, not, as expected with a different suffix. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hi, I could finally get actions runner working last night. My setup involves:
However, all of a sudden it stopped working now. All the runners seem to be offline from the Organization's actions view (as shown below)
As these runners seemed to stop working roughly after 12 hours after it first started working, is it more of a token expiration issue?
What should I do to avoid this issue?
Thanks
The text was updated successfully, but these errors were encountered: