Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Autoupdate from 2.163.1 to 2.164.0 stucks #289

Closed
Alex-ala opened this issue Jan 17, 2020 · 6 comments
Closed

Autoupdate from 2.163.1 to 2.164.0 stucks #289

Alex-ala opened this issue Jan 17, 2020 · 6 comments
Labels
bug Something isn't working

Comments

@Alex-ala
Copy link

Describe the bug
We have our self-hosted runners deployed and running for some time. Today they started self-updating from version 2.163.1 to 2.164.0 . The runner downloaded the new version and at least partially copied the new data. After the automatic restart of the runner it is not processing any more jobs. Neither cancelling and restarting the job via Web, nor restarting the runner or pushing a new commit does unstuck this runner.
For the runner to work again I had to:

  1. Remove all content in _work
  2. Let it run the update once again, so that its stuck again
  3. Kill the runner
  4. Run the _work/_update.sh with the runners user
  5. Kill the runner that was started by _update.sh (kill the script itself and the runner)
  6. Restart the runner normally

I observed that the folder-structure after the failed update has some new folders with the new version suffixed. After a correct manual update, there are symlinks to these directories that were not created during automatic update. Maybe this is the cause for the error.

To Reproduce
Steps to reproduce the behavior:

  1. Have a 2.163.1 runner running
  2. Start a job on that runner
  3. Watch it update and get stuck

Expected behavior
The runner updates correctly

Runner Version and Platform

Runner (before): 2.163.1
Runner (after): 2.164.0
OS: CentOS 7

What's not working?

Jobs are forever queued: "Starting your workflow run..."

Job Log Output

"Starting your workflow run..."

Runner and Worker's Diagnostic Logs

error.log
ls.txt

@Alex-ala Alex-ala added the bug Something isn't working label Jan 17, 2020
@Alex-ala
Copy link
Author

It seems that the only Issue with the update is moving the old /bin and /externals to bin.2.163.1 and externals.2.163.1 and then creating the symlinks in their place.
Moving these manually and creating the symlinks manually followed by a restart also fixes the runner.

@TingluoHuang
Copy link
Member

@Alex-ala there should be an update log under _daig folder of the runner root, the update log looks like Update-TIMESTAMP.succeed/failed, can you share that with us?

@Alex-ala
Copy link
Author

Alex-ala commented Jan 20, 2020

The log has neither .suceeded or .failed appended.
The content is

[2020-01-17 08:33:15-5683] --------whoami--------
github-runner
[2020-01-17 08:33:15-5714] --------whoami--------
[2020-01-17 08:33:15-5724] Waiting for Runner.Listener (12219) to complete
[2020-01-17 08:33:15-5733] Process 12219 still running
[2020-01-17 08:33:16-5788] Process 12219 still running

No additional lines are written over time (I still got one runner that is currently not working on any job but wrote this log on 20th of December).
A Restart of that runner only leads to it re-entering update state and creating a new Selfupdate logfile with the same contents.

In a fresh setup this happened:

  1. Normal start with 2.163.1
  2. Queued a job, that triggered an autoupdate
  3. Runner create the SelfUpdate logfile with above content
  4. Runner seems to restart one second later with the old version

After 4. the restarted runners logfile states, that its waiting for jobs, but its stuck until I restart it again. It is still running 2.163.1 and the file-system is as described in the issue. The PID given in the SelfUpdate log is not running anymore after step 3.

@TingluoHuang
Copy link
Member

@Alex-ala did you configure the runner as service? or running interactively from terminal? if you run the runner interactively, try ps -e to see how many runner process actually running and kill all runner processes and start the runner again.
I kind of remembered the self-update sometime stuck in a weird state where an orphan runner process somehow block normal runner execution.

@Alex-ala
Copy link
Author

Alex-ala commented Jan 21, 2020

Both, when manually stopped and during selfupdate, the runner has no process left running. There are no processes alive during a short period when the runner updates and restarts.
To fully answer your question: I got a systemd wrapped around the run.sh, stopping this stops all processes.

@Alex-ala
Copy link
Author

Thank you for your hint at running it as a service @TingluoHuang .
I checked our systemd service again, it was not set up with the template that ships with the runner. The Service ran run.sh instead of bin/runsvc.sh and that seems to cause the issue. Rewriting the systemd service to use runsvc.sh solves that issue.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants