Issues with the low-disk cronjob/hook #742

ecktom · 2020-10-13T21:12:29Z

Hi,

I just noticed some issues when a node is running out of disk space. AFAIK there are two places which are supposed to handle that situation:

https://github.com/buildkite/elastic-ci-stack-for-aws/blob/master/packer/linux/conf/bin/bk-check-disk-space.sh
While the script is working correctly and the failing env-hook is leading to a failed build, contrary to the cronjob, neither the node is getting marked as unhealthy nor the buildkite-agent is getting killed.
Thus any retry or different build which is getting scheduled on the same node is ending up in a fail until at some point the hourly cronjob would kick in and do some GC or drop the nodes.
https://github.com/buildkite/elastic-ci-stack-for-aws/blob/master/packer/linux/conf/docker/cron.hourly/docker-low-disk-gc
mark_instance_unhealthy() is working using a ERR trap. Unless I oversee something there is however nothing in

if ! /usr/local/bin/bk-check-disk-space.sh ; then
  echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL}"
  docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"
  docker builder prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"

  if ! /usr/local/bin/bk-check-disk-space.sh ; then
    echo "Disk health checks failed" >&2
    exit 1
  fi
fi

which would normally trigger this trap ;) Thus we do the docker image prune and docker builder prune but if the following disk check is still failing we simply log it to stderr and do an exit 1 which keeps the node alive. I quickly adjusted the line to echo "Disk health checks failed" >&2 && false which triggers the trap and brings me to the next issue ;)

The command aws autoscaling set-instance-health is failing with:

Marking instance as unhealthy
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    19  100    19    0     0   9500      0 --:--:-- --:--:-- --:--:--  9500
You must specify a region. You can also configure your region by running "aws configure".

A quick check with aws configure list indeed showed that there is no default region configured.

The text was updated successfully, but these errors were encountered:

ecktom · 2021-08-31T06:57:31Z

@keithduncan Many Thanks for jumping in, this Issue has been pretty annoying for quite a while. I've looked on #898 and just noticed you haven't fixed the 1) point.
So the issue at least partially exists eg:

13:00 - Cronjob runs - everything fine
13:05 - A build exhausts the disk space by eg. building/pulling some docker images
Until 14:00 every other build which will get scheduled to that node will simply fail fast (retry will schedule it to the same node)
14:00 - Cronjob cleans up the node

keithduncan · 2021-08-31T19:58:23Z

You’re right @ecktom! My plan to address that is adding a new Agent Lifecycle Hook which runs between jobs so that the agent checks whether it should poll for new jobs before accepting one that will likely fail 😄 I should have commented to that effect in my pull request.

keithduncan added the agent health Relating to whether the agent is or should pull additional work label Aug 13, 2021

keithduncan mentioned this issue Aug 30, 2021

Fix hourly disk check #898

Merged

keithduncan closed this as completed in #898 Aug 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with the low-disk cronjob/hook #742

Issues with the low-disk cronjob/hook #742

ecktom commented Oct 13, 2020

ecktom commented Aug 31, 2021

keithduncan commented Aug 31, 2021

Issues with the low-disk cronjob/hook #742

Issues with the low-disk cronjob/hook #742

Comments

ecktom commented Oct 13, 2020

ecktom commented Aug 31, 2021

keithduncan commented Aug 31, 2021