Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Only use exec probe at startup time #10741

Merged
merged 4 commits into from
Mar 2, 2021

Conversation

julz
Copy link
Member

@julz julz commented Feb 10, 2021

Fixes #10703.

Now that StartupProbe is available, we can avoid spawning an exec probe in to the user pod every 10 seconds \o/. This maintains the same probe as now at startup-time, then after that uses a HTTP readiness probe that hits the same QP endpoint the exec probe hits. This minimises the changes from today and is a nice, backwards-compatible start.

Following on from this I think we may want to rework quite a bit of how our readiness probe stuff works (e.g. it'd be nice to keep the probes on the user container so failures are on the right object, and we currently ignore probes ~entirely after startup if periodSeconds>0), but this is a minimal change that should be entirely backwards-compatible and save quite a few cpu cycles.

@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2021
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Feb 10, 2021
@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 10, 2021
@knative-prow-robot knative-prow-robot added the area/API API objects and controllers label Feb 10, 2021
@vagababov
Copy link
Contributor

🎉

@codecov
Copy link

codecov bot commented Feb 10, 2021

Codecov Report

Merging #10741 (7bbb968) into master (1e070a3) will decrease coverage by 0.14%.
The diff coverage is 90.86%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10741      +/-   ##
==========================================
- Coverage   88.16%   88.02%   -0.15%     
==========================================
  Files         186      188       +2     
  Lines        8892     9100     +208     
==========================================
+ Hits         7840     8010     +170     
- Misses        812      834      +22     
- Partials      240      256      +16     
Impacted Files Coverage Δ
cmd/activator/main.go 0.00% <0.00%> (ø)
cmd/autoscaler/main.go 11.92% <0.00%> (ø)
cmd/queue/execprobe.go 95.34% <ø> (-0.21%) ⬇️
cmd/queue/main.go 0.51% <0.00%> (ø)
pkg/activator/handler/context.go 100.00% <ø> (ø)
pkg/activator/net/throttler.go 88.85% <ø> (-0.04%) ⬇️
pkg/apis/serving/v1/revision_types.go 100.00% <ø> (ø)
pkg/reconciler/autoscaling/hpa/hpa.go 90.62% <ø> (ø)
pkg/reconciler/autoscaling/kpa/kpa.go 94.64% <ø> (ø)
pkg/reconciler/autoscaling/resources/sks.go 100.00% <ø> (ø)
... and 66 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1e070a3...c9de290. Read the comment docs.

@julz
Copy link
Member Author

julz commented Feb 10, 2021

/test pull-knative-serving-upgrade-tests

@julz
Copy link
Member Author

julz commented Feb 10, 2021

well, tests all pass soooo.. this might work!

@julz julz force-pushed the startupstartupprobe branch from 8987b7c to 793f0f9 Compare February 12, 2021 15:05
Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.
@julz julz force-pushed the startupstartupprobe branch from 793f0f9 to 1c03c42 Compare February 12, 2021 17:17
@julz
Copy link
Member Author

julz commented Feb 12, 2021

/retest

@julz julz changed the title WIP: Only use exec probe at startup time Only use exec probe at startup time Feb 12, 2021
@knative-prow-robot knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 12, 2021
@julz
Copy link
Member Author

julz commented Feb 12, 2021

Think this is good to go if you wanna take a look

/assign @vagababov @mattmoor

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold
for Markus and/or Matt for a second glance

@knative-prow-robot knative-prow-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Feb 12, 2021
@mattmoor
Copy link
Member

We're late in the cycle for this, let's land it early in 0.22 and have it bake a while.

@julz
Copy link
Member Author

julz commented Feb 24, 2021

Release shipped, @mattmoor wanna take one last look through before we unhold or should we get this baking? :)

@mattmoor
Copy link
Member

/unhold

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 24, 2021
Comment on lines 260 to 267
// During startup we want to poll the container faster than Kubernetes will
// allow, so we use an ExecProbe which starts immediately and then polls
// every 25ms. We encode the original probe as JSON in an environment
// variable for this probe to use.
userProbe := container.ReadinessProbe.DeepCopy()
applyReadinessProbeDefaultsForExec(userProbe, userPort)
execProbe := makeExecProbe(userProbe)
userProbeJSON, err := readiness.EncodeProbe(userProbe)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that using the readiness probe as the startup probe is consistent with K8s when the latter is unspecified? Have you thought at all about letting users specify startupProbe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah, as I understand it startupprobe just pauses the other probes until it succeeds (so we avoid accidentally probing before the pod is ready and ending up being fined a periodSeconds until we can try again). It doesn't have any "extra" semantics we'll be accidentally importing.

Re: user startupProbe yes I am! Will open an issue in a minute. They're less important, I think, for our use case because generally they're used for slow starting containers but there's no real reason not to let users use them (especially since they should be wholly orthogonal if applied to user container rather than QP).

@mattmoor
Copy link
Member

You don't have to block on me, I just have a question. Thanks for driving this, and for your patience waiting on startupProbe to be available 😉

@julz
Copy link
Member Author

julz commented Feb 25, 2021

looks like this still needs an approve from someone 🙏 @mattmoor @vagababov @dprotaso

@julz julz force-pushed the startupstartupprobe branch from 264d2e9 to 7f94f2d Compare February 26, 2021 11:33
@julz
Copy link
Member Author

julz commented Feb 26, 2021

Think this might be legit, but let's double check it's not a flake before digging in:

/test pull-knative-serving-upgrade-tests

@julz
Copy link
Member Author

julz commented Feb 26, 2021

Yeah that was the problem, guess I should've tidied up the flag name in a separate PR anyway (we were always treating it as a timeout even though the flag was -period). Seems to be passing prow now.

@julz
Copy link
Member Author

julz commented Feb 26, 2021

ok @dprotaso @mattmoor @vagababov wanna take a look at the updated version? tldr this now only uses the exec probe for aggressive probes (if periodSeconds>1 we just use a regular readiness probe and no startup), and it sets failurethreshold=1 and timeout=progressdeadline so we'll wait as long as we would wait on initial deployment before giving up.

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dprotaso
Copy link
Member

dprotaso commented Feb 26, 2021

Yeah that was the problem, guess I should've tidied up the flag name in a separate PR anyway (we were always treating it as a timeout even though the flag was -period). Seems to be passing prow now.

This is surprising - I thought this change would be ok given setting the flag would be paired with the new image sha - what was the exact error?

@julz
Copy link
Member Author

julz commented Feb 26, 2021

yeah Im not sure that's why I rolled the dice again, but it failed both times before the change (all the initial services failed to go ready)... then again now Im looking at prow history there's a lot of red there separately :/. In theory I think the new code could reconcile with the old QP because it may not have seen the configmap update with the new QP image straight away, but yeah that does seem like a pretty small window 🤔.. do the updates do anything funky in how they upgrade the configmaps?

@julz
Copy link
Member Author

julz commented Mar 2, 2021

anyone want to take a last look or are we ready to unhold and get this baking? @mattmoor @vagababov (@dprotaso already approved)

@mattmoor
Copy link
Member

mattmoor commented Mar 2, 2021

Don't block on me :)

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just nits
/lgtm

Handler: corev1.Handler{
Exec: &corev1.ExecAction{
Command: []string{"/ko-app/queue", "-probe-period", timeout.String()},
Command: []string{"/ko-app/queue", "-probe-period", progressDeadline.String()},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably move the comment why we equal this to PD?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

defaults, _ = apicfg.NewDefaultsConfigFromMap(nil)
revCfg = config.Config{
deploymentConfig = deployment.Config{
ProgressDeadline: 5678 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envision a named constant in the future of this file? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, yeah, it's a bit of a pain tho because we need 5678 * time.Second here and some other places, but plain ol' 5678 in the TimeoutSeconds fields (because it's not a duration), and "1h34m38s" in the -probe-period flag, so it's either 3 constants or the tests have to duplicate the code a bit :(. I vote we clean it up in a follow-up :D

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2021
@knative-prow-robot knative-prow-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2021
@julz
Copy link
Member Author

julz commented Mar 2, 2021

/unhold

next lgtm from @vagababov should get this going now!

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 2, 2021
Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 2, 2021
@julz
Copy link
Member Author

julz commented Mar 2, 2021

🎉

@julz
Copy link
Member Author

julz commented Mar 2, 2021

Looks like a flake: #10871 (not wild about how often I'm typing retest these days, tho)

/retest

@knative-prow-robot knative-prow-robot merged commit e9ea07e into knative:master Mar 2, 2021
markusthoemmes pushed a commit to markusthoemmes/knative-serving that referenced this pull request Mar 23, 2021
* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments
openshift-merge-robot pushed a commit to openshift/knative-serving that referenced this pull request Mar 24, 2021
…ency (#703)

* Only use exec probe at startup time (knative#10741)

* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments

* Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992)

* Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748)

* Avoid deleting a stat if a request raced the reporter

This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel.

That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine.

The In event is lost and the stats concurrency becomes negative, unrecoverably.

* Avoid pointer

Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
markusthoemmes added a commit to markusthoemmes/knative-serving that referenced this pull request Apr 7, 2021
…ency (knative#703)

* Only use exec probe at startup time (knative#10741)

* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments

* Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992)

* Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748)

* Avoid deleting a stat if a request raced the reporter

This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel.

That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine.

The In event is lost and the stats concurrency becomes negative, unrecoverably.

* Avoid pointer

Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
markusthoemmes added a commit to markusthoemmes/knative-serving that referenced this pull request Apr 7, 2021
…ency (knative#703)

* Only use exec probe at startup time (knative#10741)

* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments

* Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992)

* Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748)

* Avoid deleting a stat if a request raced the reporter

This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel.

That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine.

The In event is lost and the stats concurrency becomes negative, unrecoverably.

* Avoid pointer

Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
markusthoemmes pushed a commit to markusthoemmes/knative-serving that referenced this pull request Apr 7, 2021
* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments
openshift-merge-robot pushed a commit to openshift/knative-serving that referenced this pull request Apr 7, 2021
* Fix race condition with Prober logger in upgrade tests (#670)

Fixes knative/pkg#2026

The actual issue is that the test context expires between individual
stages run by the upgrade framework. This fix passes an external logger
that survives the stages.

* Only use exec probe at startup time (knative#10741)

* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments

* Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992)

Co-authored-by: Martin Gencur <mgencur@redhat.com>
Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale area/networking cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use StartupProbe in Queue Proxy
5 participants