Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Make crash loops resolve faster #2128

Merged
merged 3 commits into from
Sep 4, 2020
Merged

Conversation

ssalinas
Copy link
Member

@ssalinas ssalinas commented Sep 4, 2020

Currently we bucket failures into time windows. For example:

|O|O|X|X|O|O|O|O|

and we take into account the number of overall buckets that have failures. However, the one piece I thought the code compensated well for, but does not is that there is a very big difference between:

|X|X|O|O|O|O|O|O|

and

|O|O|O|O|O|X|X|O|

If you imagine each of those buckets is a 3 minute window, that overall window can stretch 30mins. This PR adds an additional condition that the most recent failure timestamp must be in the most recent X% of buckets (e.g. one of the failures must be in the most recent 8 minutes). This should cover cases where something failed for a while then recovered on its own

@ssalinas ssalinas added the staging Merged to staging branch label Sep 4, 2020
@mjball
Copy link
Contributor

mjball commented Sep 4, 2020

Thank you!

@pschoenfelder
Copy link
Contributor

lgtm 🚢

@ssalinas ssalinas merged commit 6b551a0 into master Sep 4, 2020
@ssalinas ssalinas deleted the crash_loop_resolution_time branch September 4, 2020 15:00
@ssalinas ssalinas added this to the 1.3.0 milestone Sep 24, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
staging Merged to staging branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants