Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Build does not fail in the pipeline upload stage when there is no available disk space in agent's machine. #1105

Closed
bilalm19 opened this issue Oct 10, 2019 · 2 comments

Comments

@bilalm19
Copy link

bilalm19 commented Oct 10, 2019

I was testing what error buildkite will show when there is no available space in disk. When the disk is full during the build process (tasks after buildkite has uploaded the pipeline.yml file) it shows:

🚨 Error: Error creating hook script: write /tmp/buildkite-agent-bootstrap-hook-runner-584912760: no space left on device
🚨 Error: Error tearing down bootstrap: write /tmp/buildkite-agent-bootstrap-hook-runner-584912760: no space left on device

This is as expected.

But when I try to initialize another build with no available disk space in the agent's machine, the build does not seem to fail (waited for about an hour before cancelling the build). The build process is stuck at the buildkite-agent pipeline upload .buildkite/pipeline.yml stage. The agent keeps trying to accept the job as evident by the logs using journalctl -f -u buildkite-agent:

-- Logs begin at Wed 2019-10-09 07:44:19 UTC. --
Oct 10 12:29:15 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:15 INFO   agent-machine-1 Assigned job f1985fac-f458-4628-8ab1-73ddc3615531. Accepting...
Oct 10 12:29:15 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:15 ERROR  agent-machine-1 Failed to initialize job: open /tmp/job-env-f1985fac-f458-4628-8ab1-73ddc3615531916098660: no space left on device
Oct 10 12:29:17 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:17 INFO   agent-machine-1 Assigned job f1985fac-f458-4628-8ab1-73ddc3615531. Accepting...
Oct 10 12:29:17 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:17 ERROR  agent-machine-1 Failed to initialize job: open /tmp/job-env-f1985fac-f458-4628-8ab1-73ddc3615531961538675: no space left on device
Oct 10 12:29:19 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:19 INFO   agent-machine-1 Assigned job f1985fac-f458-4628-8ab1-73ddc3615531. Accepting...
Oct 10 12:29:19 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:19 ERROR  agent-machine-1 Failed to initialize job: open /tmp/job-env-f1985fac-f458-4628-8ab1-73ddc3615531707490614: no space left on device
Oct 10 12:29:21 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:21 INFO   agent-machine-1 Assigned job f1985fac-f458-4628-8ab1-73ddc3615531. Accepting...
Oct 10 12:29:22 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:22 ERROR  agent-machine-1 Failed to initialize job: open /tmp/job-env-f1985fac-f458-4628-8ab1-73ddc3615531040249629: no space left on device
Oct 10 12:29:23 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:23 INFO   agent-machine-1 Assigned job f1985fac-f458-4628-8ab1-73ddc3615531. Accepting...
Oct 10 12:29:23 agent-machine buildkite-agent[2564]: 2019-10-10 12:29:23 ERROR  agent-machine-1 Failed to initialize job: open /tmp/job-env-f1985fac-f458-4628-8ab1-73ddc3615531375800280: no space left on device

The failure to initialize job is not shown in the build log as it keeps waiting for the agent, and also does not exit with an error.

@lox
Copy link
Contributor

lox commented Oct 16, 2019

Sorry you ran into this! Running out of disk and memory are really tough problems to detect and recover from I'm afraid.

I think we probably need to fail an agent after it's failed to accept a certain amount of jobs 🤔

@keithduncan
Copy link
Contributor

Thanks for opening this issue! I’m going to close this one in favour of #1111 which proposes an inter-job healthcheck hook that could be used to perform inline health checks and prevent unhealthy agents from accepting further jobs. 🙇

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants