Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Allow supervisor to recover after crash #519

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

darinwilson
Copy link

This is a first pass at a fix for #512: allowing Solid Queue to recover if the database goes offline (or if it fails for any other reason).

In the case of the database going away, there are a few possible scenarios that can cause the supervisor to fail, but the most common is:

  • worker tries and fails to poll the database
  • worker tries to deregister (via after_shutdown in Registrable) and fails
  • Process#deregister re-raises any exceptions that come up during deregistration, so worker crashes
  • supervisor detects the crashed process and tries to replace it, but fails
  • supervisor also inherits from Process so when it fails, it calls deregister just like the worker did
  • nothing else is around to catch the exception raised in deregister so Solid Queue terminates completely

After a restart, the maintenance tasks performed by the supervisor do a good job of cleaning up the loose ends left behind, so it seemed like the cleanest approach was just to let the supervisor crash, then spin up a new instance. This is handled by a new Launcher class that wraps Supervisor#start in a retry block with exponential backoff.

This also adds a new config parameter max_restart_attempts that allows the user to limit the number of restart attempts. If nil, it will retry forever, and if 0, it won't try at all. (I made 0 the default since that's the current behavior.)

I tested with Postgres and MySQL, but didn't really know how to test SQLite or if it even made sense to. Again, this is just a first attempt - happy to try a different approach if this doesn't seem quite right.

Thanks!

@darinwilson
Copy link
Author

Looking into the test failures

@darinwilson
Copy link
Author

All checks but 1 are passing, and the failure was due to an error from Docker:

/usr/bin/docker create --name 067820fef0e2421091a40c8747b66a14_mysql8031_27b123 --label 42a304 --network github_network_4a20af9528964765a5e3250c78009207 --network-alias mysql -p 33060:3306 --health-cmd "mysql -h localhost -e \"select now()\"" --health-interval 1s --health-timeout 5s --health-retries 30 -e "MYSQL_ALLOW_EMPTY_PASSWORD=yes" -e GITHUB_ACTIONS=true -e CI=true mysql:8.0.31
  b6e254956276fbbbdaa878d38f3c826cbef2afd738c7a383b6e6380450611489
  /usr/bin/docker start b6e254956276fbbbdaa878d38f3c826cbef2afd738c7a383b6e6380450611489
  Error response from daemon: driver failed programming external connectivity on endpoint 067820fef0e2421091a40c8747b66a14_mysql8031_27b123 (6238b6123635c6941643dbf6ac903990818d03ff85b58fd928b89f0bb53a9856): Error starting userland proxy: listen tcp4 0.0.0.0:33060: bind: address already in use
  Error: failed to start containers: b6e254956276fbbbdaa878d38f3c826cbef2afd738c7a383b6e6380450611489
  Error: Docker start fail with exit code 1

I tried to see if there was a way to re-run the job, but couldn't find anything. Am I missing something obvious? 😅

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant