Pod-startup regression due to probing periods #10973

markusthoemmes · 2021-03-18T12:38:37Z

I've been doing some investigation into pod startup times since that's something the autoscaling WG wants to optimize, and that made me stumble into wildly variant pod-startup time data. (Note: All data is captured using https://github.com/markusthoemmes/podspeed, I might totally have a bug there).

v0.21 and prior

Our assumption was, that the exec readiness probe would start running after the container has been started initially. That assumption is wrong. Generally, the probes run on their own timer async to pod startup (until recently, but more on that later). The readiness probe's period was bumped to 10s in #8147. If the container is not yet started when the timer ticks first, it'll only tick again 10s later, making for... well.. very slow startup times:

HEAD

@julz swapped us over to using a startup probe that's running on a tight-ish timer. That in theory fixes the above, but the readiness probe is still running on a 10s period. Even with the startup probe passing, the readiness probe is required to mark the container ready. K8s v1.21 has a fix that kicks the readiness probe as soon as the startup probe passes, making that process very quick. As it stands however, the 10s timer causes the same friction as above, causing super variable startup times.

A few datapoints to illustrate

HEAD

Created 10 knative-head pods sequentially, results are in ms:
min: 1873, max: 11434, mean: 6356, p95: 10646, p99: 10646

HEAD (reducing the readiness probe period to 1s)

Created 10 knative-head pods sequentially, results are in ms:
min: 1323, max: 2645, mean: 2001, p95: 2603, p99: 2603

So... we should do something here! We definitely need E2E test coverage for cases like this, so we don't regress as badly in the future. Since #8147 is no longer an issue, I think we can safely reduce the period of our readiness probe (which is now an HTTP probe) to 1s to at least get us back down to more sensible and stable levels.

The text was updated successfully, but these errors were encountered:

markusthoemmes · 2021-03-18T12:38:43Z

/assign

julz · 2021-03-18T13:35:23Z

Added a rough sketch of what may be a additional workaround for this (amongst other things) in #10978.

evankanderson · 2021-03-19T17:52:00Z

/area autoscale
/triage assigned

knative-prow-robot · 2021-03-19T17:52:02Z

@evankanderson: The label(s) triage/assigned cannot be applied, because the repository doesn't have them.

In response to this:

/area autoscale
/triage assigned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

evankanderson · 2021-03-19T17:52:05Z

/triage accepted

dprotaso · 2021-03-30T13:54:06Z

We should separate 'cold-start' with Pod 'Readiness' as the activator doesn't wait for the Pod to be 'Ready' since it has it's own 200ms polling tight loop.

Is there a follow up issue that captures the changes we should make when we bump to k8s/1.21?

markusthoemmes · 2021-03-30T14:01:38Z

I agree in principle @dprotaso. Both measures are important for different circumstances. There's no followup issue (yet) but the code currently says

// TODO(#10973): Remove this once we're on K8s 1.21

Do we still want to open a tracker?

dprotaso · 2021-03-30T18:20:40Z

Made the issue - should be good :)

markusthoemmes added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2021

knative-prow-robot assigned markusthoemmes Mar 18, 2021

julz mentioned this issue Mar 18, 2021

Would it be terrible if: we default to no readiness probe if the user doesnt ask for one #10978

Open

markusthoemmes mentioned this issue Mar 19, 2021

Default PeriodSeconds of the readiness probe to 1 if unset #10992

Merged

knative-prow-robot added the area/autoscale label Mar 19, 2021

knative-prow-robot added the triage/accepted Issues which should be fixed (post-triage) label Mar 19, 2021

knative-prow-robot closed this as completed in #10992 Mar 22, 2021

dprotaso mentioned this issue Mar 30, 2021

Drop aggressive readiness probe when we adopt K8s 1.21 #11078

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod-startup regression due to probing periods #10973

Pod-startup regression due to probing periods #10973

markusthoemmes commented Mar 18, 2021

markusthoemmes commented Mar 18, 2021

julz commented Mar 18, 2021 •

edited

Loading

evankanderson commented Mar 19, 2021

knative-prow-robot commented Mar 19, 2021

evankanderson commented Mar 19, 2021

dprotaso commented Mar 30, 2021

markusthoemmes commented Mar 30, 2021

dprotaso commented Mar 30, 2021

Pod-startup regression due to probing periods #10973

Pod-startup regression due to probing periods #10973

Comments

markusthoemmes commented Mar 18, 2021

v0.21 and prior

HEAD

HEAD

HEAD (reducing the readiness probe period to 1s)

markusthoemmes commented Mar 18, 2021

julz commented Mar 18, 2021 • edited Loading

evankanderson commented Mar 19, 2021

knative-prow-robot commented Mar 19, 2021

evankanderson commented Mar 19, 2021

dprotaso commented Mar 30, 2021

markusthoemmes commented Mar 30, 2021

dprotaso commented Mar 30, 2021

julz commented Mar 18, 2021 •

edited

Loading