Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Help trace worker crash in Kusto. #450

Merged
merged 3 commits into from
Apr 28, 2020

Conversation

TingluoHuang
Copy link
Member

There was a bug in the Actions hosted runner virtual environment that causes the available disk space reduced. http://github.com/actions/virtual-environments/issues/709

Some customer's workflow used all disk space and cause the runner crash as well.

We want to get alert when the number of worker crash increase, so we can proactively investigate these issue instead of waiting for customer report.

Each worker crash means there is a failed workflow run and the customer is not happy with it.

@@ -952,8 +951,10 @@ private async Task LogWorkerProcessUnhandledException(Pipelines.AgentJobRequestM
ArgUtil.NotNull(timeline, nameof(timeline));
TimelineRecord jobRecord = timeline.Records.FirstOrDefault(x => x.Id == message.JobId && x.RecordType == "Job");
ArgUtil.NotNull(jobRecord, nameof(jobRecord));
var unhandledExceptionIssue = new Issue() { Type = IssueType.Error, Message = errorMessage };
unhandledExceptionIssue.Data[Constants.Runner.WorkerCrashIssueDataKey] = string.Empty;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service will check the issue data to decide whether fire product trace event.

Copy link
Collaborator

@ericsciple ericsciple Apr 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about:

Should we make this more generic? Do we have any other cases that warrant a more generic telemetry mechanism, e.g. [...].Data["_telemetry"] = "WORKER_CRASH"

@@ -486,7 +486,10 @@ public void ProcessCommand(IExecutionContext context, string inputLine, ActionCo

foreach (var property in command.Properties)
{
issue.Data[property.Key] = property.Value;
if (!string.Equals(property.Key, Constants.Runner.WorkerCrashIssueDataKey, StringComparison.OrdinalIgnoreCase))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont we already filter everything other than specific properties?

@TingluoHuang TingluoHuang merged commit 70729fb into master Apr 28, 2020
@TingluoHuang TingluoHuang deleted the users/tihuang/traceworkercrash branch April 28, 2020 03:44
AdamOlech pushed a commit to antmicro/runner that referenced this pull request Jan 28, 2021
* Help trace worker crash in Kusto.

* more

* feedback.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants