Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Pod-startup regression due to probing periods #10973

Closed
markusthoemmes opened this issue Mar 18, 2021 · 8 comments · Fixed by #10992
Closed

Pod-startup regression due to probing periods #10973

markusthoemmes opened this issue Mar 18, 2021 · 8 comments · Fixed by #10992
Assignees
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)

Comments

@markusthoemmes
Copy link
Contributor

I've been doing some investigation into pod startup times since that's something the autoscaling WG wants to optimize, and that made me stumble into wildly variant pod-startup time data. (Note: All data is captured using https://github.com/markusthoemmes/podspeed, I might totally have a bug there).

v0.21 and prior

Our assumption was, that the exec readiness probe would start running after the container has been started initially. That assumption is wrong. Generally, the probes run on their own timer async to pod startup (until recently, but more on that later). The readiness probe's period was bumped to 10s in #8147. If the container is not yet started when the timer ticks first, it'll only tick again 10s later, making for... well.. very slow startup times:

HEAD

@julz swapped us over to using a startup probe that's running on a tight-ish timer. That in theory fixes the above, but the readiness probe is still running on a 10s period. Even with the startup probe passing, the readiness probe is required to mark the container ready. K8s v1.21 has a fix that kicks the readiness probe as soon as the startup probe passes, making that process very quick. As it stands however, the 10s timer causes the same friction as above, causing super variable startup times.

A few datapoints to illustrate

HEAD

Created 10 knative-head pods sequentially, results are in ms:
min: 1873, max: 11434, mean: 6356, p95: 10646, p99: 10646

HEAD (reducing the readiness probe period to 1s)

Created 10 knative-head pods sequentially, results are in ms:
min: 1323, max: 2645, mean: 2001, p95: 2603, p99: 2603

So... we should do something here! We definitely need E2E test coverage for cases like this, so we don't regress as badly in the future. Since #8147 is no longer an issue, I think we can safely reduce the period of our readiness probe (which is now an HTTP probe) to 1s to at least get us back down to more sensible and stable levels.

@markusthoemmes markusthoemmes added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2021
@markusthoemmes
Copy link
Contributor Author

/assign

@julz
Copy link
Member

julz commented Mar 18, 2021

Added a rough sketch of what may be a additional workaround for this (amongst other things) in #10978.

@evankanderson
Copy link
Member

/area autoscale
/triage assigned

@knative-prow-robot
Copy link
Contributor

@evankanderson: The label(s) triage/assigned cannot be applied, because the repository doesn't have them.

In response to this:

/area autoscale
/triage assigned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@evankanderson
Copy link
Member

/triage accepted

@knative-prow-robot knative-prow-robot added the triage/accepted Issues which should be fixed (post-triage) label Mar 19, 2021
@dprotaso
Copy link
Member

We should separate 'cold-start' with Pod 'Readiness' as the activator doesn't wait for the Pod to be 'Ready' since it has it's own 200ms polling tight loop.

Is there a follow up issue that captures the changes we should make when we bump to k8s/1.21?

@markusthoemmes
Copy link
Contributor Author

I agree in principle @dprotaso. Both measures are important for different circumstances. There's no followup issue (yet) but the code currently says

// TODO(#10973): Remove this once we're on K8s 1.21

Do we still want to open a tracker?

@dprotaso
Copy link
Member

Made the issue - should be good :)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants