Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Not enouch Inodes #961

Closed
darrenwhighamfd opened this issue Nov 16, 2021 · 4 comments
Closed

Not enouch Inodes #961

darrenwhighamfd opened this issue Nov 16, 2021 · 4 comments
Labels
agent health Relating to whether the agent is or should pull additional work custom-configuration

Comments

@darrenwhighamfd
Copy link

darrenwhighamfd commented Nov 16, 2021

Hi,

We are starting to see this error more and more on our more intense queues used for building

Checking docker
--
  | CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
  | Checking disk space
  | Disk space free: 8.2G
  | Inodes free: 152K
  | Not enough inodes free, cutoff is 250000 🚨
  | Cleaning up docker resources older than 4h
  | Total reclaimed space: 0B
  | Checking disk space again
  | Disk space free: 8.2G
  | Inodes free: 152K
  | Not enough inodes free, cutoff is 250000 🚨
  | Disk health checks failed
  | 🚨 Error: Error setting up bootstrap: The global environment hook exited with status 1

I see the script for checking here https://github.com/buildkite/elastic-ci-stack-for-aws/blob/master/packer/linux/conf/bin/bk-check-disk-space.sh &

docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL:-4h}"

However our agents are typically shorter lived than 4 hours as we scale as needed. As a result the clean up does not help,
unless we set this as a lower value or is there another way around this issue? Other than just adding more disk space to the agent which I think increases the Inodes available or reducing the agent life and spinning them down sooner.

@darrenwhighamfd
Copy link
Author

I see this issue was raised in 2018 to allow seeing of the values for the inode checks and clean up time
#465

Not sure if this would be something to revisit but wanted to highlight

@keithduncan keithduncan added agent health Relating to whether the agent is or should pull additional work custom-configuration labels Nov 16, 2021
@keithduncan
Copy link
Contributor

Thank you for opening this issue @darrenwhighamfd.

To confirm my understanding is correct once these instances’ disk has filled up to this point they fail any builds they are assigned until they are replaced?

Medium to long term I would like to move this heath check out of the job lifecycle and into the agent lifecycle so that agents whose host is unhealthy do not accept jobs.

In the short term to get this working for you again, would you be able to append a value for the DOCKER_PRUNE_UNTIL environment variable to the /var/lib/buildkite-agent/cfn-env file using a script passed to the template BootstrapScriptUrl parameter?

@darrenwhighamfd
Copy link
Author

Thanks @keithduncan Thats correct about the issue, We have set DOCKER_PRUNE_UNTIL to a lower value for now to see if it helps mitigate the issue.

@keithduncan
Copy link
Contributor

Good to hear @darrenwhighamfd, I’m going to close this in the short term as we have hopefully mitigated the acute issue and have the long term fix tracked in buildkite/agent#1111

If this recurs for you as our mitigation is insufficient please don’t hesitate to re-open or leave a command and we’ll work on a new solution. 🙇

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
agent health Relating to whether the agent is or should pull additional work custom-configuration
Projects
None yet
Development

No branches or pull requests

2 participants