Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Issues with the low-disk cronjob/hook #742

Closed
ecktom opened this issue Oct 13, 2020 · 2 comments · Fixed by #898
Closed

Issues with the low-disk cronjob/hook #742

ecktom opened this issue Oct 13, 2020 · 2 comments · Fixed by #898
Labels
agent health Relating to whether the agent is or should pull additional work

Comments

@ecktom
Copy link

ecktom commented Oct 13, 2020

Hi,

I just noticed some issues when a node is running out of disk space. AFAIK there are two places which are supposed to handle that situation:

  1. https://github.com/buildkite/elastic-ci-stack-for-aws/blob/master/packer/linux/conf/bin/bk-check-disk-space.sh
    While the script is working correctly and the failing env-hook is leading to a failed build, contrary to the cronjob, neither the node is getting marked as unhealthy nor the buildkite-agent is getting killed.
    Thus any retry or different build which is getting scheduled on the same node is ending up in a fail until at some point the hourly cronjob would kick in and do some GC or drop the nodes.

  2. https://github.com/buildkite/elastic-ci-stack-for-aws/blob/master/packer/linux/conf/docker/cron.hourly/docker-low-disk-gc
    mark_instance_unhealthy() is working using a ERR trap. Unless I oversee something there is however nothing in

if ! /usr/local/bin/bk-check-disk-space.sh ; then
  echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL}"
  docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"
  docker builder prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"

  if ! /usr/local/bin/bk-check-disk-space.sh ; then
    echo "Disk health checks failed" >&2
    exit 1
  fi
fi

which would normally trigger this trap ;) Thus we do the docker image prune and docker builder prune but if the following disk check is still failing we simply log it to stderr and do an exit 1 which keeps the node alive. I quickly adjusted the line to echo "Disk health checks failed" >&2 && false which triggers the trap and brings me to the next issue ;)

The command aws autoscaling set-instance-health is failing with:

Marking instance as unhealthy
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    19  100    19    0     0   9500      0 --:--:-- --:--:-- --:--:--  9500
You must specify a region. You can also configure your region by running "aws configure".

A quick check with aws configure list indeed showed that there is no default region configured.

@keithduncan keithduncan added the agent health Relating to whether the agent is or should pull additional work label Aug 13, 2021
@ecktom
Copy link
Author

ecktom commented Aug 31, 2021

@keithduncan Many Thanks for jumping in, this Issue has been pretty annoying for quite a while. I've looked on #898 and just noticed you haven't fixed the 1) point.
So the issue at least partially exists eg:

13:00 - Cronjob runs - everything fine
13:05 - A build exhausts the disk space by eg. building/pulling some docker images
Until 14:00 every other build which will get scheduled to that node will simply fail fast (retry will schedule it to the same node)
14:00 - Cronjob cleans up the node

@keithduncan
Copy link
Contributor

You’re right @ecktom! My plan to address that is adding a new Agent Lifecycle Hook which runs between jobs so that the agent checks whether it should poll for new jobs before accepting one that will likely fail 😄 I should have commented to that effect in my pull request.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
agent health Relating to whether the agent is or should pull additional work
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants