Autoupdate from 2.163.1 to 2.164.0 stucks #289

Alex-ala · 2020-01-17T08:01:18Z

Describe the bug
We have our self-hosted runners deployed and running for some time. Today they started self-updating from version 2.163.1 to 2.164.0 . The runner downloaded the new version and at least partially copied the new data. After the automatic restart of the runner it is not processing any more jobs. Neither cancelling and restarting the job via Web, nor restarting the runner or pushing a new commit does unstuck this runner.
For the runner to work again I had to:

Remove all content in _work
Let it run the update once again, so that its stuck again
Kill the runner
Run the _work/_update.sh with the runners user
Kill the runner that was started by _update.sh (kill the script itself and the runner)
Restart the runner normally

I observed that the folder-structure after the failed update has some new folders with the new version suffixed. After a correct manual update, there are symlinks to these directories that were not created during automatic update. Maybe this is the cause for the error.

To Reproduce
Steps to reproduce the behavior:

Have a 2.163.1 runner running
Start a job on that runner
Watch it update and get stuck

Expected behavior
The runner updates correctly

Runner Version and Platform

Runner (before): 2.163.1
Runner (after): 2.164.0
OS: CentOS 7

What's not working?

Jobs are forever queued: "Starting your workflow run..."

Job Log Output

"Starting your workflow run..."

Runner and Worker's Diagnostic Logs

error.log
ls.txt

Alex-ala · 2020-01-17T08:38:39Z

It seems that the only Issue with the update is moving the old /bin and /externals to bin.2.163.1 and externals.2.163.1 and then creating the symlinks in their place.
Moving these manually and creating the symlinks manually followed by a restart also fixes the runner.

TingluoHuang · 2020-01-19T03:06:37Z

@Alex-ala there should be an update log under _daig folder of the runner root, the update log looks like Update-TIMESTAMP.succeed/failed, can you share that with us?

Alex-ala · 2020-01-20T06:52:20Z

The log has neither .suceeded or .failed appended.
The content is

[2020-01-17 08:33:15-5683] --------whoami--------
github-runner
[2020-01-17 08:33:15-5714] --------whoami--------
[2020-01-17 08:33:15-5724] Waiting for Runner.Listener (12219) to complete
[2020-01-17 08:33:15-5733] Process 12219 still running
[2020-01-17 08:33:16-5788] Process 12219 still running

No additional lines are written over time (I still got one runner that is currently not working on any job but wrote this log on 20th of December).
A Restart of that runner only leads to it re-entering update state and creating a new Selfupdate logfile with the same contents.

In a fresh setup this happened:

Normal start with 2.163.1
Queued a job, that triggered an autoupdate
Runner create the SelfUpdate logfile with above content
Runner seems to restart one second later with the old version

After 4. the restarted runners logfile states, that its waiting for jobs, but its stuck until I restart it again. It is still running 2.163.1 and the file-system is as described in the issue. The PID given in the SelfUpdate log is not running anymore after step 3.

TingluoHuang · 2020-01-21T03:33:10Z

@Alex-ala did you configure the runner as service? or running interactively from terminal? if you run the runner interactively, try ps -e to see how many runner process actually running and kill all runner processes and start the runner again.
I kind of remembered the self-update sometime stuck in a weird state where an orphan runner process somehow block normal runner execution.

Alex-ala · 2020-01-21T06:02:14Z

Both, when manually stopped and during selfupdate, the runner has no process left running. There are no processes alive during a short period when the runner updates and restarts.
To fully answer your question: I got a systemd wrapped around the run.sh, stopping this stops all processes.

Alex-ala · 2020-01-21T06:22:01Z

Thank you for your hint at running it as a service @TingluoHuang .
I checked our systemd service again, it was not set up with the template that ships with the runner. The Service ran run.sh instead of bin/runsvc.sh and that seems to cause the issue. Rewriting the systemd service to use runsvc.sh solves that issue.

Alex-ala added the bug Something isn't working label Jan 17, 2020

Alex-ala closed this as completed Jan 21, 2020

igorbrigadir mentioned this issue Jun 25, 2020

Runners become offline actions/actions-runner-controller#62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoupdate from 2.163.1 to 2.164.0 stucks #289

Autoupdate from 2.163.1 to 2.164.0 stucks #289

Alex-ala commented Jan 17, 2020

Alex-ala commented Jan 17, 2020

TingluoHuang commented Jan 19, 2020

Alex-ala commented Jan 20, 2020 •

edited

Loading

TingluoHuang commented Jan 21, 2020

Alex-ala commented Jan 21, 2020 •

edited

Loading

Alex-ala commented Jan 21, 2020

Autoupdate from 2.163.1 to 2.164.0 stucks #289

Autoupdate from 2.163.1 to 2.164.0 stucks #289

Comments

Alex-ala commented Jan 17, 2020

Runner Version and Platform

What's not working?

Job Log Output

Runner and Worker's Diagnostic Logs

Alex-ala commented Jan 17, 2020

TingluoHuang commented Jan 19, 2020

Alex-ala commented Jan 20, 2020 • edited Loading

TingluoHuang commented Jan 21, 2020

Alex-ala commented Jan 21, 2020 • edited Loading

Alex-ala commented Jan 21, 2020

Alex-ala commented Jan 20, 2020 •

edited

Loading

Alex-ala commented Jan 21, 2020 •

edited

Loading