-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
add warning when running out of disk. #873
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many customers have this issue, do we think? I'm wondering if there are runs that will be impacted and get this notation all the time and it will annoy those customers (maybe not).
Do we already have some telemetry in place that can tell us how frequently this happens so when we release this we can stop it from rolling further if it has unintended noise consequences?
src/Runner.Worker/JobExtension.cs
Outdated
@@ -325,6 +328,11 @@ public async Task<List<IStep>> InitializeJob(IExecutionContext jobContext, Pipel | |||
} | |||
} | |||
|
|||
if (jobContext.Global.Variables.GetBoolean("__DISK.CHECK") ?? true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this convention? Do we have other variables like this?
I think we should save this off as a constant somewhere and explain where it comes from in the code.
src/Runner.Worker/JobExtension.cs
Outdated
var workDirRoot = Directory.GetDirectoryRoot(HostContext.GetDirectory(WellKnownDirectory.Work)); | ||
var driveInfo = new DriveInfo(workDirRoot); | ||
var freeSpaceInMB = driveInfo.AvailableFreeSpace / 1024 / 1024; | ||
if (freeSpaceInMB < 100) // Add warning when disk is lower than 100 MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this configurable? Maybe instead of a boolean free disk check we have it be an int and then if the int exists and is non-zero we check for that number of MB/KB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we can add telemetry so we can discern what portion of abandons are due to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue.Data[Constants.Runner.InternalTelemetryIssueDataKey] = Constants.Runner.LowDiskSpace;
will make this issue shows in Kusto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approved this but probably should merge/update some based on my suggestions.
86f6f87
to
a550df4
Compare
a550df4
to
b5d5812
Compare
Base on telemetry, lots of workflow run failed as the runner machine run out of disk.
When the machine runs out of disk, the runner may crash due to can't write to the log files, the OS may also freeze up.
The workflow run will look like hanging there do nothing and eventually gets abandoned by the service.
The PR tries to add a warning annotation to the job when there is less than 100MB left on the disk, to provide a hint to customers about why their job failed.