Not enouch Inodes #961

darrenwhighamfd · 2021-11-16T10:07:04Z

Hi,

We are starting to see this error more and more on our more intense queues used for building

Checking docker
--
  | CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
  | Checking disk space
  | Disk space free: 8.2G
  | Inodes free: 152K
  | Not enough inodes free, cutoff is 250000 🚨
  | Cleaning up docker resources older than 4h
  | Total reclaimed space: 0B
  | Checking disk space again
  | Disk space free: 8.2G
  | Inodes free: 152K
  | Not enough inodes free, cutoff is 250000 🚨
  | Disk health checks failed
  | 🚨 Error: Error setting up bootstrap: The global environment hook exited with status 1

I see the script for checking here https://github.com/buildkite/elastic-ci-stack-for-aws/blob/master/packer/linux/conf/bin/bk-check-disk-space.sh &

elastic-ci-stack-for-aws/packer/linux/conf/buildkite-agent/hooks/environment

Line 34 in 272916a

docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL:-4h}"

However our agents are typically shorter lived than 4 hours as we scale as needed. As a result the clean up does not help,
unless we set this as a lower value or is there another way around this issue? Other than just adding more disk space to the agent which I think increases the Inodes available or reducing the agent life and spinning them down sooner.

The text was updated successfully, but these errors were encountered:

darrenwhighamfd · 2021-11-16T10:27:46Z

I see this issue was raised in 2018 to allow seeing of the values for the inode checks and clean up time
#465

Not sure if this would be something to revisit but wanted to highlight

keithduncan · 2021-11-16T22:56:01Z

Thank you for opening this issue @darrenwhighamfd.

To confirm my understanding is correct once these instances’ disk has filled up to this point they fail any builds they are assigned until they are replaced?

Medium to long term I would like to move this heath check out of the job lifecycle and into the agent lifecycle so that agents whose host is unhealthy do not accept jobs.

In the short term to get this working for you again, would you be able to append a value for the DOCKER_PRUNE_UNTIL environment variable to the /var/lib/buildkite-agent/cfn-env file using a script passed to the template BootstrapScriptUrl parameter?

darrenwhighamfd · 2021-11-19T08:40:57Z

Thanks @keithduncan Thats correct about the issue, We have set DOCKER_PRUNE_UNTIL to a lower value for now to see if it helps mitigate the issue.

keithduncan · 2021-11-22T00:12:33Z

Good to hear @darrenwhighamfd, I’m going to close this in the short term as we have hopefully mitigated the acute issue and have the long term fix tracked in buildkite/agent#1111

If this recurs for you as our mitigation is insufficient please don’t hesitate to re-open or leave a command and we’ll work on a new solution. 🙇

keithduncan added agent health Relating to whether the agent is or should pull additional work custom-configuration labels Nov 16, 2021

keithduncan closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not enouch Inodes #961

Not enouch Inodes #961

darrenwhighamfd commented Nov 16, 2021 •

edited

Loading

darrenwhighamfd commented Nov 16, 2021

keithduncan commented Nov 16, 2021

darrenwhighamfd commented Nov 19, 2021

keithduncan commented Nov 22, 2021

Not enouch Inodes #961

Not enouch Inodes #961

Comments

darrenwhighamfd commented Nov 16, 2021 • edited Loading

darrenwhighamfd commented Nov 16, 2021

keithduncan commented Nov 16, 2021

darrenwhighamfd commented Nov 19, 2021

keithduncan commented Nov 22, 2021

darrenwhighamfd commented Nov 16, 2021 •

edited

Loading