Only use exec probe at startup time #10741

julz · 2021-02-10T18:12:56Z

Now that StartupProbe is available, we can avoid spawning an exec probe in to the user pod every 10 seconds \o/. This maintains the same probe as now at startup-time, then after that uses a HTTP readiness probe that hits the same QP endpoint the exec probe hits. This minimises the changes from today and is a nice, backwards-compatible start.

Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and save quite a few cpu cycles.

vagababov · 2021-02-10T18:13:47Z

🎉

codecov · 2021-02-10T18:18:45Z

Codecov Report

Merging #10741 (7bbb968) into master (1e070a3) will decrease coverage by 0.14%.
The diff coverage is 90.86%.

@@            Coverage Diff             @@
##           master   #10741      +/-   ##
==========================================
- Coverage   88.16%   88.02%   -0.15%     
==========================================
  Files         186      188       +2     
  Lines        8892     9100     +208     
==========================================
+ Hits         7840     8010     +170     
- Misses        812      834      +22     
- Partials      240      256      +16

Impacted Files	Coverage Δ
cmd/activator/main.go	`0.00% <0.00%> (ø)`
cmd/autoscaler/main.go	`11.92% <0.00%> (ø)`
cmd/queue/execprobe.go	`95.34% <ø> (-0.21%)`	⬇️
cmd/queue/main.go	`0.51% <0.00%> (ø)`
pkg/activator/handler/context.go	`100.00% <ø> (ø)`
pkg/activator/net/throttler.go	`88.85% <ø> (-0.04%)`	⬇️
pkg/apis/serving/v1/revision_types.go	`100.00% <ø> (ø)`
pkg/reconciler/autoscaling/hpa/hpa.go	`90.62% <ø> (ø)`
pkg/reconciler/autoscaling/kpa/kpa.go	`94.64% <ø> (ø)`
pkg/reconciler/autoscaling/resources/sks.go	`100.00% <ø> (ø)`
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e070a3...c9de290. Read the comment docs.

julz · 2021-02-10T19:40:39Z

/test pull-knative-serving-upgrade-tests

julz · 2021-02-10T20:15:35Z

well, tests all pass soooo.. this might work!

Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles.

julz · 2021-02-12T18:51:10Z

/retest

julz · 2021-02-12T19:07:19Z

Think this is good to go if you wanna take a look

/assign @vagababov @mattmoor

vagababov

/lgtm
/hold
for Markus and/or Matt for a second glance

mattmoor · 2021-02-13T14:44:13Z

We're late in the cycle for this, let's land it early in 0.22 and have it bake a while.

julz · 2021-02-24T09:59:18Z

Release shipped, @mattmoor wanna take one last look through before we unhold or should we get this baking? :)

mattmoor · 2021-02-24T15:02:21Z

/unhold

mattmoor · 2021-02-24T15:05:02Z

pkg/reconciler/revision/resources/queue.go

+	// During startup we want to poll the container faster than Kubernetes will
+	// allow, so we use an ExecProbe which starts immediately and then polls
+	// every 25ms. We encode the original probe as JSON in an environment
+	// variable for this probe to use.
+	userProbe := container.ReadinessProbe.DeepCopy()
+	applyReadinessProbeDefaultsForExec(userProbe, userPort)
+	execProbe := makeExecProbe(userProbe)
+	userProbeJSON, err := readiness.EncodeProbe(userProbe)


I assume that using the readiness probe as the startup probe is consistent with K8s when the latter is unspecified? Have you thought at all about letting users specify startupProbe?

yah, as I understand it startupprobe just pauses the other probes until it succeeds (so we avoid accidentally probing before the pod is ready and ending up being fined a periodSeconds until we can try again). It doesn't have any "extra" semantics we'll be accidentally importing.

Re: user startupProbe yes I am! Will open an issue in a minute. They're less important, I think, for our use case because generally they're used for slow starting containers but there's no real reason not to let users use them (especially since they should be wholly orthogonal if applied to user container rather than QP).

mattmoor · 2021-02-24T15:06:33Z

You don't have to block on me, I just have a question. Thanks for driving this, and for your patience waiting on startupProbe to be available 😉

julz · 2021-02-25T08:18:07Z

looks like this still needs an approve from someone 🙏 @mattmoor @vagababov @dprotaso

julz · 2021-02-26T12:21:31Z

Think this might be legit, but let's double check it's not a flake before digging in:

/test pull-knative-serving-upgrade-tests

julz · 2021-02-26T14:11:56Z

Yeah that was the problem, guess I should've tidied up the flag name in a separate PR anyway (we were always treating it as a timeout even though the flag was -period). Seems to be passing prow now.

julz · 2021-02-26T14:29:20Z

ok @dprotaso @mattmoor @vagababov wanna take a look at the updated version? tldr this now only uses the exec probe for aggressive probes (if periodSeconds>1 we just use a regular readiness probe and no startup), and it sets failurethreshold=1 and timeout=progressdeadline so we'll wait as long as we would wait on initial deployment before giving up.

knative-prow-robot · 2021-02-26T15:06:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/queue/OWNERS~~ [dprotaso]
~~pkg/reconciler/OWNERS~~ [dprotaso]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dprotaso · 2021-02-26T15:11:39Z

Yeah that was the problem, guess I should've tidied up the flag name in a separate PR anyway (we were always treating it as a timeout even though the flag was -period). Seems to be passing prow now.

This is surprising - I thought this change would be ok given setting the flag would be paired with the new image sha - what was the exact error?

julz · 2021-02-26T15:16:43Z

yeah Im not sure that's why I rolled the dice again, but it failed both times before the change (all the initial services failed to go ready)... then again now Im looking at prow history there's a lot of red there separately :/. In theory I think the new code could reconcile with the old QP because it may not have seen the configmap update with the new QP image straight away, but yeah that does seem like a pretty small window 🤔.. do the updates do anything funky in how they upgrade the configmaps?

julz · 2021-03-02T20:36:09Z

anyone want to take a last look or are we ready to unhold and get this baking? @mattmoor @vagababov (@dprotaso already approved)

mattmoor · 2021-03-02T20:57:37Z

Don't block on me :)

vagababov

just nits
/lgtm

vagababov · 2021-03-02T20:56:50Z

pkg/reconciler/revision/resources/queue.go

 		Handler: corev1.Handler{
 			Exec: &corev1.ExecAction{
-				Command: []string{"/ko-app/queue", "-probe-period", timeout.String()},
+				Command: []string{"/ko-app/queue", "-probe-period", progressDeadline.String()},


Probably move the comment why we equal this to PD?

vagababov · 2021-03-02T20:58:50Z

pkg/reconciler/revision/resources/queue_test.go

-	defaults, _      = apicfg.NewDefaultsConfigFromMap(nil)
-	revCfg           = config.Config{
+	deploymentConfig = deployment.Config{
+		ProgressDeadline: 5678 * time.Second,


I envision a named constant in the future of this file? :)

heh, yeah, it's a bit of a pain tho because we need 5678 * time.Second here and some other places, but plain ol' 5678 in the TimeoutSeconds fields (because it's not a duration), and "1h34m38s" in the -probe-period flag, so it's either 3 constants or the tests have to duplicate the code a bit :(. I vote we clean it up in a follow-up :D

julz · 2021-03-02T21:12:06Z

/unhold

next lgtm from @vagababov should get this going now!

vagababov

/lgtm

julz · 2021-03-02T21:14:55Z

🎉

julz · 2021-03-02T22:06:08Z

Looks like a flake: #10871 (not wild about how often I'm typing retest these days, tho)

/retest

* Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments

…ency (#703) * Only use exec probe at startup time (knative#10741) * Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments * Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992) * Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748) * Avoid deleting a stat if a request raced the reporter This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel. That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine. The In event is lost and the stats concurrency becomes negative, unrecoverably. * Avoid pointer Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>

…ency (knative#703) * Only use exec probe at startup time (knative#10741) * Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments * Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992) * Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748) * Avoid deleting a stat if a request raced the reporter This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel. That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine. The In event is lost and the stats concurrency becomes negative, unrecoverably. * Avoid pointer Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>

* Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments

* Fix race condition with Prober logger in upgrade tests (#670) Fixes knative/pkg#2026 The actual issue is that the test context expires between individual stages run by the upgrade framework. This fix passes an external logger that survives the stages. * Only use exec probe at startup time (knative#10741) * Only use exec probe at startup time Now that StartupProbe is available, we can avoid using spawning the exec probe other than at startup time. For requests after startup this directly uses the same endpoint as the exec probe in the QP as the target of a HTTP readiness probe. Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and saves quite a few cpu cycles. * Use ProgressDeadline as failure timeout for startup probe - Also just drop exec probe entirely for periodSeconds > 1 since these can just use the readiness probe now. (Easier than figuring out how to do square ProgressDeadline with a custom period). * See if flag is what's making upgrades unhappy * reorganize comments * Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992) Co-authored-by: Martin Gencur <mgencur@redhat.com> Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>

knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2021

google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Feb 10, 2021

knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 10, 2021

knative-prow-robot requested review from markusthoemmes and tcnghia February 10, 2021 18:13

knative-prow-robot added the area/API API objects and controllers label Feb 10, 2021

julz force-pushed the startupstartupprobe branch from 8987b7c to 793f0f9 Compare February 12, 2021 15:05

julz force-pushed the startupstartupprobe branch from 793f0f9 to 1c03c42 Compare February 12, 2021 17:17

julz changed the title ~~WIP: Only use exec probe at startup time~~ Only use exec probe at startup time Feb 12, 2021

knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 12, 2021

knative-prow-robot assigned mattmoor and vagababov Feb 12, 2021

vagababov reviewed Feb 12, 2021

View reviewed changes

knative-prow-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Feb 12, 2021

knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 24, 2021

mattmoor reviewed Feb 24, 2021

View reviewed changes

dprotaso approved these changes Feb 25, 2021

View reviewed changes

julz force-pushed the startupstartupprobe branch from 264d2e9 to 7f94f2d Compare February 26, 2021 11:33

See if flag is what's making upgrades unhappy

fa36e82

dprotaso approved these changes Feb 26, 2021

View reviewed changes

dprotaso mentioned this pull request Mar 1, 2021

rename queueproxy flag probe-period to probe-timeout #10855

Merged

vagababov reviewed Mar 2, 2021

View reviewed changes

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2021

reorganize comments

c9de290

knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2021

knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 2, 2021

vagababov reviewed Mar 2, 2021

View reviewed changes

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2021

knative-prow-robot merged commit e9ea07e into knative:master Mar 2, 2021

julz mentioned this pull request Mar 18, 2021

Would it be terrible if: we default to no readiness probe if the user doesnt ask for one #10978

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only use exec probe at startup time #10741

Only use exec probe at startup time #10741

julz commented Feb 10, 2021 •

edited

Loading

vagababov commented Feb 10, 2021

codecov bot commented Feb 10, 2021 •

edited

Loading

julz commented Feb 10, 2021

julz commented Feb 10, 2021

julz commented Feb 12, 2021

julz commented Feb 12, 2021

vagababov left a comment

mattmoor commented Feb 13, 2021

julz commented Feb 24, 2021

mattmoor commented Feb 24, 2021

mattmoor Feb 24, 2021

julz Feb 25, 2021

mattmoor commented Feb 24, 2021

julz commented Feb 25, 2021 •

edited

Loading

julz commented Feb 26, 2021

julz commented Feb 26, 2021

julz commented Feb 26, 2021

knative-prow-robot commented Feb 26, 2021

dprotaso commented Feb 26, 2021 •

edited

Loading

julz commented Feb 26, 2021 •

edited

Loading

julz commented Mar 2, 2021

mattmoor commented Mar 2, 2021

vagababov left a comment

vagababov Mar 2, 2021

julz Mar 2, 2021

vagababov Mar 2, 2021

julz Mar 2, 2021

julz commented Mar 2, 2021

vagababov left a comment

julz commented Mar 2, 2021

julz commented Mar 2, 2021 •

edited

Loading

Only use exec probe at startup time #10741

Only use exec probe at startup time #10741

Conversation

julz commented Feb 10, 2021 • edited Loading

vagababov commented Feb 10, 2021

codecov bot commented Feb 10, 2021 • edited Loading

Codecov Report

julz commented Feb 10, 2021

julz commented Feb 10, 2021

julz commented Feb 12, 2021

julz commented Feb 12, 2021

vagababov left a comment

Choose a reason for hiding this comment

mattmoor commented Feb 13, 2021

julz commented Feb 24, 2021

mattmoor commented Feb 24, 2021

mattmoor Feb 24, 2021

Choose a reason for hiding this comment

julz Feb 25, 2021

Choose a reason for hiding this comment

mattmoor commented Feb 24, 2021

julz commented Feb 25, 2021 • edited Loading

julz commented Feb 26, 2021

julz commented Feb 26, 2021

julz commented Feb 26, 2021

knative-prow-robot commented Feb 26, 2021

dprotaso commented Feb 26, 2021 • edited Loading

julz commented Feb 26, 2021 • edited Loading

julz commented Mar 2, 2021

mattmoor commented Mar 2, 2021

vagababov left a comment

Choose a reason for hiding this comment

vagababov Mar 2, 2021

Choose a reason for hiding this comment

julz Mar 2, 2021

Choose a reason for hiding this comment

vagababov Mar 2, 2021

Choose a reason for hiding this comment

julz Mar 2, 2021

Choose a reason for hiding this comment

julz commented Mar 2, 2021

vagababov left a comment

Choose a reason for hiding this comment

julz commented Mar 2, 2021

julz commented Mar 2, 2021 • edited Loading

julz commented Feb 10, 2021 •

edited

Loading

codecov bot commented Feb 10, 2021 •

edited

Loading

julz commented Feb 25, 2021 •

edited

Loading

dprotaso commented Feb 26, 2021 •

edited

Loading

julz commented Feb 26, 2021 •

edited

Loading

julz commented Mar 2, 2021 •

edited

Loading