-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add healthcheck
Agent Lifecycle Hook that runs between jobs
#1111
Comments
If sent a |
That is a good point. However, one concern I'd have with that is that cronjobs can only run once per minute. An agent could accept and fail a number of jobs before the cronjob gets run. If the agent could run the healthcheck between jobs that wouldn't be a problem. The cronjob idea might be a reasonable workaround though, if you don't think it's appropriate to add it to the agent. |
To clarify a bit, the cronjob solution would work well for a preventative shutdown under normal/predictable use. That's useful but I'm also concerned about builds that create problems very quickly (we have some big builds that do this on a semi-regular basis). A more general solution might be to add a |
Yeah, |
There is now a |
healthcheck
Agent Lifecycle Hook that runs between jobs
Similar scope here, but I see #1363 was closed as a duplicate so I'll mention here. On completion of an arbitrary job it is possible agent resources (e.g. databases, background services) to be left in a bad state, either through implementation mistakes, timeouts, or errors. These problems are currently dealt with in pipelines as a safeguard against build flakes but this solution is not without problems.
A better way to deal with this would be in a "recovery" phase where the agent can prepare for the next job (if there is one) without increasing agent wait times (since agents in the pool should all be clean and healthy). |
buildkite-agent
has adisconnect-after-idle-timeout
setting which causes it to shut down after an idle period.It would be nice if there was an option to disconnect / shut down the agent if a healthcheck fails. We run an elastic autoscaled pool of agents, and shutting down in the case of a problem would allow the autoscaling group to replace unhealthy agents.
Currently when an agent has its disk fill up for example then all jobs allocated to it will fail, but it will still keep accepting jobs regardless. These often require manual intervention to kill the instance and let the pool recover. Having the agent stop accepting jobs and shut down when available disk space is low would allow them to be replaced with clean agents without manual intervention.
The text was updated successfully, but these errors were encountered: