Make crash loops resolve faster #2128

ssalinas · 2020-09-04T13:58:54Z

Currently we bucket failures into time windows. For example:

|O|O|X|X|O|O|O|O|

and we take into account the number of overall buckets that have failures. However, the one piece I thought the code compensated well for, but does not is that there is a very big difference between:

|X|X|O|O|O|O|O|O|

and

|O|O|O|O|O|X|X|O|

If you imagine each of those buckets is a 3 minute window, that overall window can stretch 30mins. This PR adds an additional condition that the most recent failure timestamp must be in the most recent X% of buckets (e.g. one of the failures must be in the most recent 8 minutes). This should cover cases where something failed for a while then recovered on its own

mjball · 2020-09-04T14:06:25Z

Thank you!

SingularityService/src/main/java/com/hubspot/singularity/scheduler/SingularityCrashLoops.java

SingularityService/src/main/java/com/hubspot/singularity/config/CrashLoopConfiguration.java

pschoenfelder · 2020-09-04T15:00:26Z

lgtm 🚢

Make crash loops resolve faster

fe5a12c

ssalinas added the staging Merged to staging branch label Sep 4, 2020

mjball reviewed Sep 4, 2020

View reviewed changes

SingularityService/src/main/java/com/hubspot/singularity/scheduler/SingularityCrashLoops.java Outdated Show resolved Hide resolved

use List for readability

044e7f3

pschoenfelder reviewed Sep 4, 2020

View reviewed changes

SingularityService/src/main/java/com/hubspot/singularity/config/CrashLoopConfiguration.java Outdated Show resolved Hide resolved

Update CrashLoopConfiguration.java

1e5b782

ssalinas merged commit 6b551a0 into master Sep 4, 2020

ssalinas deleted the crash_loop_resolution_time branch September 4, 2020 15:00

ssalinas added this to the 1.3.0 milestone Sep 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make crash loops resolve faster #2128

Make crash loops resolve faster #2128

ssalinas commented Sep 4, 2020

mjball commented Sep 4, 2020

pschoenfelder commented Sep 4, 2020

Make crash loops resolve faster #2128

Make crash loops resolve faster #2128

Conversation

ssalinas commented Sep 4, 2020

mjball commented Sep 4, 2020

pschoenfelder commented Sep 4, 2020