Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Would it be terrible if: we default to no readiness probe if the user doesnt ask for one #10978

Open
julz opened this issue Mar 18, 2021 · 19 comments
Labels
area/API API objects and controllers area/autoscale kind/enhancement triage/accepted Issues which should be fixed (post-triage)
Milestone

Comments

@julz
Copy link
Member

julz commented Mar 18, 2021

/area API
/area autoscale

Describe the feature

Today, if a user doesnt specify a readiness probe we introduce a default (tcp) probe for them. Since #10741 this is run via both a startup probe (using an intermediate exec probe in queue proxy), and as a readiness probe which runs after the startup probe succeeds. The reason we do this is to avoid flakes when scaling up, ensuring the pod is at least ready to accept tcp connections before routing to it.

Before StartupProbe was available, running this as a ReadinessProbe was really the only way to accomplish this (readiness was the only game in town even tho we only really, I think, cared about startup flakes). However, now that we have StartupProbe it might be possible to only run the tcp probe for the startupprobe, and then not run any readiness probe at all. This is technically a behaviour change since we would currently detect the pod going unready in this way after startup, but it (a) is a closer match to what the user actually asked for -- they didnt specify a probe at all, and wouldnt get one in raw k8s (b) would be a partial workaround for the issue in #10973 since we would avoid potentially missing the window to run the readiness probe after the startup probe completes (c) is simpler and more efficient (especially since, due to (b) we need the probe period to be as short as possible..). We also (d) might be able to do even better and do the startup probe in a PostStart lifecycle hook rather than an actual StartupProbe which might fix #10973 entirely (while still avoiding flakes on scale-up).

Note: this is a rough thought for discussion. It may (well) have a fatal flaw I haven't spotted yet.

@julz julz added the kind/feature Well-understood/specified features, ready for coding. label Mar 18, 2021
@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/autoscale labels Mar 18, 2021
@markusthoemmes
Copy link
Contributor

FWIW, I toyed a bit with a PostStart hook executing our binary and that seems to perform in the same ballpark as containers without probes at all.

@markusthoemmes
Copy link
Contributor

This has some great content on the tradeoffs we'd buy into when using lifecycle hooks: https://livebook.manning.com/book/kubernetes-in-action/chapter-17/64.

In a nutshell: Error reporting seems worse, so we prolly need to adjust the way we report errors from pods and the insight into why stuff fails seems to not be great. It'd also be interesting to doublecheck if the pod's IP is appearing as NonReadyAddress before the lifecycle runs, since it's going into a Pending state rather a NotReady state. I haven't checked on these pieces yet at all.

@markusthoemmes
Copy link
Contributor

Tripleposting 😂

I did investigate the behavior of the postStart hook briefly, and sadly, I was right above: The pod isn't getting any IP while the post-start hook is running, so that'd destroy other optimizations that we have in the activator to completely bypass the readiness signals and just probe the container on our own and send traffic to it ASAP (without waiting for signals to propagate through the API for instance).

@evankanderson
Copy link
Member

It sounds like this doesn't work for right now?

/close

(Feel free to reopen and add more notes if this should stay open)

@knative-prow-robot
Copy link
Contributor

@evankanderson: Closing this issue.

In response to this:

It sounds like this doesn't work for right now?

/close

(Feel free to reopen and add more notes if this should stay open)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@julz
Copy link
Member Author

julz commented Mar 19, 2021

/reopen

The bit that doesn't work is the postStart idea, which was a little side comment, not the main point of the issue (the thread's a bit confusing I guess since we side tracked ourselves a bit on a side issue). Tldr I still think it might make sense to avoid the unnecessary readiness check (every second x every pod) that the user didn't ask for since we only really need a startup check to do what we want (avoid flakes on scale up).

@knative-prow-robot
Copy link
Contributor

@julz: Reopened this issue.

In response to this:

/reopen

The bit that doesn't work is the postStart idea, which was a little side comment, not the main point of the issue (the thread's a bit confusing I guess since we side tracked ourselves a bit on a side issue). Tldr I still think it might make sense to avoid the unnecessary readiness check (every second x every pod) that the user didn't ask for since we only really need a startup check to do what we want (avoid flakes on scale up).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@evankanderson
Copy link
Member

/kind enhancement
/remove-kind feature

Does this need a proposal before implementation?

/triage assigned

@knative-prow-robot
Copy link
Contributor

@evankanderson: The label(s) triage/assigned cannot be applied, because the repository doesn't have them.

In response to this:

/kind enhancement
/remove-kind feature

Does this need a proposal before implementation?

/triage assigned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot removed the kind/feature Well-understood/specified features, ready for coding. label Mar 22, 2021
@evankanderson
Copy link
Member

/triage accepted

@knative-prow-robot knative-prow-robot added the triage/accepted Issues which should be fixed (post-triage) label Mar 22, 2021
@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2021
@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 2, 2021
@psschwei
Copy link
Contributor

psschwei commented Nov 1, 2021

Not defaulting in a readiness probe would be useful for the container-freezer, as a readiness probe on a frozen container isn't likely to be very useful (and I'm pretty sure it doesn't make sense to wake up a frozen container every second for a probe :) )

@evankanderson
Copy link
Member

I don't have an intrinsic concern with this on second thought (my first thought was "oop, don't do that"). It does seem like the container-freezer might want to run the startup probe on un-freeze, and would still need to be able to handle containers with an explicit readiness check.

@psschwei
Copy link
Contributor

Copying @julz 's comment on potentially swapping out the readiness probe for a startup probe: #12281 (comment)

@psschwei
Copy link
Contributor

psschwei commented Feb 4, 2022

/assign

@xtaje
Copy link

xtaje commented Jul 14, 2022

... optimizations that we have in the activator to completely bypass the readiness signals and just probe the container on our own and send traffic to it ASAP (without waiting for signals to propagate through the API for instance).

Is there any documentation on this behavior or reasons behind it? I found it quite surprising and only figured it out after debugging why a Pod revision was not switching over to ready, despite the service container reporting as ready. Our service needs several minutes during startup to load a large data model, and the readinessProbe parameters we had chosen were causing funny behavior.

For the use case I have, the extra latency of checking the container status via API would not be a problem. It seems that asking the API for ready state would allow you to use exec probes too, and not change the behavior of the container probe config also.

@psschwei
Copy link
Contributor

It's a bit of a long story (most of which was before my time, but there's a number of issues from early 2021 (?) that provide more details), but the short version is: containers are often "ready" before the API shows they are Ready, so checking the container directly ends up being faster.

@dprotaso dprotaso added this to the Icebox milestone Aug 2, 2022
@dprotaso
Copy link
Member

dprotaso commented Aug 2, 2022

Not defaulting in a readiness probe would be useful for the container-freezer,

FWIW - If I was a user (using container-freezer) and I configure a readiness probe I would expect it to probe while the container is active. But I would expect it to be paused when the container is paused.

I'm unsure if that's possible in K8s right now

@psschwei
Copy link
Contributor

psschwei commented Aug 2, 2022

Yeah, the case where a user specifically added a probe would be.... tricky, to say the least. I think the order of doing probe updates would go something like:

  • add default startup probe (Support startupProbe in webhook #10037 )
  • remove default readiness probe (this one)
  • figure out how the heck to handle user readiness probes if freezer is enabled (no issue yet)

@psschwei psschwei removed their assignment Oct 18, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
area/API API objects and controllers area/autoscale kind/enhancement triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

7 participants