Add `healthcheck` Agent Lifecycle Hook that runs between jobs #1111

dbaggerman · 2019-10-18T04:32:29Z

buildkite-agent has a disconnect-after-idle-timeout setting which causes it to shut down after an idle period.

It would be nice if there was an option to disconnect / shut down the agent if a healthcheck fails. We run an elastic autoscaled pool of agents, and shutting down in the case of a problem would allow the autoscaling group to replace unhealthy agents.

Currently when an agent has its disk fill up for example then all jobs allocated to it will fail, but it will still keep accepting jobs regardless. These often require manual intervention to kill the instance and let the pool recover. Having the agent stop accepting jobs and shut down when available disk space is low would allow them to be replaced with clean agents without manual intervention.

The text was updated successfully, but these errors were encountered:

lox · 2019-10-18T04:40:03Z

If sent a SIGTERM the agent will stop accepting new work but finish it's current job. I'd be tempted to solve this with a cronjob that checked health of the instance and gracefully terminated the instance and shutdown the instance 🤔

dbaggerman · 2019-10-18T04:54:31Z

That is a good point. However, one concern I'd have with that is that cronjobs can only run once per minute. An agent could accept and fail a number of jobs before the cronjob gets run. If the agent could run the healthcheck between jobs that wouldn't be a problem.

The cronjob idea might be a reasonable workaround though, if you don't think it's appropriate to add it to the agent.

dbaggerman · 2019-10-22T00:50:57Z

To clarify a bit, the cronjob solution would work well for a preventative shutdown under normal/predictable use. That's useful but I'm also concerned about builds that create problems very quickly (we have some big builds that do this on a semi-regular basis).

A more general solution might be to add a post-exit hook that runs outside of a job. The hook could then SIGTERM the parent to achieve the result I'm looking for, while being flexible enough to have other uses.

lox · 2019-10-23T03:34:26Z

Yeah, post-exit is one I've wanted for quite a while. The other option that we've been pondering is a pre-accept hook that runs before the agent accepts and provides an opportunity to decline accepting the job.

zimbatm · 2021-07-30T13:36:55Z

There is now a pre-bootstrap hook introduced by #1456 . I think it's intended to validate the environment variables, but it could also be used to quickly clean the filesystem before accepting the next job.

Silic0nS0ldier · 2022-04-21T02:44:36Z

Similar scope here, but I see #1363 was closed as a duplicate so I'll mention here.

On completion of an arbitrary job it is possible agent resources (e.g. databases, background services) to be left in a bad state, either through implementation mistakes, timeouts, or errors. These problems are currently dealt with in pipelines as a safeguard against build flakes but this solution is not without problems.

It complicates job logic.
It increases the time required to complete jobs.
In rare cases issues can't be resolved, leading to job failures (and eventually retry logic, complicating things further).
Increases risk of timeouts (a type of flake which we are trying to avoid).

A better way to deal with this would be in a "recovery" phase where the agent can prepare for the next job (if there is one) without increasing agent wait times (since agents in the pool should all be clean and healthy).

keithduncan mentioned this issue Jul 28, 2021

[feature] post-exit hook #1363

Closed

keithduncan added the hook label Aug 13, 2021

keithduncan mentioned this issue Aug 13, 2021

Build does not fail in the pipeline upload stage when there is no available disk space in agent's machine. #1105

Closed

keithduncan added the agent health Relating to whether the agent is or should pull additional work label Aug 13, 2021

keithduncan changed the title ~~Feature request: Disconnect health-check~~ Add healthcheck Agent Lifecycle Hook that runs between jobs Aug 30, 2021

keithduncan mentioned this issue Aug 31, 2021

Issues with the low-disk cronjob/hook buildkite/elastic-ci-stack-for-aws#742

Closed

keithduncan mentioned this issue Nov 22, 2021

Not enouch Inodes buildkite/elastic-ci-stack-for-aws#961

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `healthcheck` Agent Lifecycle Hook that runs between jobs #1111

Add `healthcheck` Agent Lifecycle Hook that runs between jobs #1111

dbaggerman commented Oct 18, 2019

lox commented Oct 18, 2019

dbaggerman commented Oct 18, 2019

dbaggerman commented Oct 22, 2019

lox commented Oct 23, 2019

zimbatm commented Jul 30, 2021

Silic0nS0ldier commented Apr 21, 2022 •

edited

Loading

Add healthcheck Agent Lifecycle Hook that runs between jobs #1111

Add healthcheck Agent Lifecycle Hook that runs between jobs #1111

Comments

dbaggerman commented Oct 18, 2019

lox commented Oct 18, 2019

dbaggerman commented Oct 18, 2019

dbaggerman commented Oct 22, 2019

lox commented Oct 23, 2019

zimbatm commented Jul 30, 2021

Silic0nS0ldier commented Apr 21, 2022 • edited Loading

Add `healthcheck` Agent Lifecycle Hook that runs between jobs #1111

Add `healthcheck` Agent Lifecycle Hook that runs between jobs #1111

Silic0nS0ldier commented Apr 21, 2022 •

edited

Loading