-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Only use exec probe at startup time #10741
Conversation
🎉 |
Codecov Report
@@ Coverage Diff @@
## master #10741 +/- ##
==========================================
- Coverage 88.16% 88.02% -0.15%
==========================================
Files 186 188 +2
Lines 8892 9100 +208
==========================================
+ Hits 7840 8010 +170
- Misses 812 834 +22
- Partials 240 256 +16
Continue to review full report at Codecov.
|
/test pull-knative-serving-upgrade-tests |
well, tests all pass soooo.. this might work! |
8987b7c
to
793f0f9
Compare
Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles.
793f0f9
to
1c03c42
Compare
/retest |
Think this is good to go if you wanna take a look /assign @vagababov @mattmoor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
for Markus and/or Matt for a second glance
We're late in the cycle for this, let's land it early in 0.22 and have it bake a while. |
Release shipped, @mattmoor wanna take one last look through before we unhold or should we get this baking? :) |
/unhold |
// During startup we want to poll the container faster than Kubernetes will | ||
// allow, so we use an ExecProbe which starts immediately and then polls | ||
// every 25ms. We encode the original probe as JSON in an environment | ||
// variable for this probe to use. | ||
userProbe := container.ReadinessProbe.DeepCopy() | ||
applyReadinessProbeDefaultsForExec(userProbe, userPort) | ||
execProbe := makeExecProbe(userProbe) | ||
userProbeJSON, err := readiness.EncodeProbe(userProbe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that using the readiness probe as the startup probe is consistent with K8s when the latter is unspecified? Have you thought at all about letting users specify startupProbe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah, as I understand it startupprobe just pauses the other probes until it succeeds (so we avoid accidentally probing before the pod is ready and ending up being fined a periodSeconds until we can try again). It doesn't have any "extra" semantics we'll be accidentally importing.
Re: user startupProbe yes I am! Will open an issue in a minute. They're less important, I think, for our use case because generally they're used for slow starting containers but there's no real reason not to let users use them (especially since they should be wholly orthogonal if applied to user container rather than QP).
You don't have to block on me, I just have a question. Thanks for driving this, and for your patience waiting on startupProbe to be available 😉 |
looks like this still needs an approve from someone 🙏 @mattmoor @vagababov @dprotaso |
264d2e9
to
7f94f2d
Compare
Think this might be legit, but let's double check it's not a flake before digging in: /test pull-knative-serving-upgrade-tests |
Yeah that was the problem, guess I should've tidied up the flag name in a separate PR anyway (we were always treating it as a timeout even though the flag was -period). Seems to be passing prow now. |
ok @dprotaso @mattmoor @vagababov wanna take a look at the updated version? tldr this now only uses the exec probe for aggressive probes (if periodSeconds>1 we just use a regular readiness probe and no startup), and it sets failurethreshold=1 and timeout=progressdeadline so we'll wait as long as we would wait on initial deployment before giving up. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dprotaso The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is surprising - I thought this change would be ok given setting the flag would be paired with the new image sha - what was the exact error? |
yeah Im not sure that's why I rolled the dice again, but it failed both times before the change (all the initial services failed to go ready)... then again now Im looking at prow history there's a lot of red there separately :/. In theory I think the new code could reconcile with the old QP because it may not have seen the configmap update with the new QP image straight away, but yeah that does seem like a pretty small window 🤔.. do the updates do anything funky in how they upgrade the configmaps? |
anyone want to take a last look or are we ready to unhold and get this baking? @mattmoor @vagababov (@dprotaso already approved) |
Don't block on me :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just nits
/lgtm
Handler: corev1.Handler{ | ||
Exec: &corev1.ExecAction{ | ||
Command: []string{"/ko-app/queue", "-probe-period", timeout.String()}, | ||
Command: []string{"/ko-app/queue", "-probe-period", progressDeadline.String()}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably move the comment why we equal this to PD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
defaults, _ = apicfg.NewDefaultsConfigFromMap(nil) | ||
revCfg = config.Config{ | ||
deploymentConfig = deployment.Config{ | ||
ProgressDeadline: 5678 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I envision a named constant in the future of this file? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, yeah, it's a bit of a pain tho because we need 5678 * time.Second here and some other places, but plain ol' 5678 in the TimeoutSeconds fields (because it's not a duration), and "1h34m38s" in the -probe-period
flag, so it's either 3 constants or the tests have to duplicate the code a bit :(. I vote we clean it up in a follow-up :D
/unhold next lgtm from @vagababov should get this going now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
🎉 |
Looks like a flake: #10871 (not wild about how often I'm typing retest these days, tho) /retest |
* Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments
…ency (#703) * Only use exec probe at startup time (knative#10741) * Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments * Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992) * Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748) * Avoid deleting a stat if a request raced the reporter This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel. That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine. The In event is lost and the stats concurrency becomes negative, unrecoverably. * Avoid pointer Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
…ency (knative#703) * Only use exec probe at startup time (knative#10741) * Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments * Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992) * Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748) * Avoid deleting a stat if a request raced the reporter This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel. That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine. The In event is lost and the stats concurrency becomes negative, unrecoverably. * Avoid pointer Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
…ency (knative#703) * Only use exec probe at startup time (knative#10741) * Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments * Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992) * Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748) * Avoid deleting a stat if a request raced the reporter This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel. That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine. The In event is lost and the stats concurrency becomes negative, unrecoverably. * Avoid pointer Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
* Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments
* Fix race condition with Prober logger in upgrade tests (#670) Fixes knative/pkg#2026 The actual issue is that the test context expires between individual stages run by the upgrade framework. This fix passes an external logger that survives the stages. * Only use exec probe at startup time (knative#10741) * Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments * Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992) Co-authored-by: Martin Gencur <mgencur@redhat.com> Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
Fixes #10703.
Now that StartupProbe is available, we can avoid spawning an exec probe in to the user pod every 10 seconds \o/. This maintains the same probe as now at startup-time, then after that uses a HTTP readiness probe that hits the same QP endpoint the exec probe hits. This minimises the changes from today and is a nice, backwards-compatible start.
Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and save quite a few cpu cycles.