Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Default PeriodSeconds of the readiness probe to 1 if unset #10992

Merged
merged 1 commit into from
Mar 22, 2021

Conversation

markusthoemmes
Copy link
Contributor

Fixes #10973

Proposed Changes

#10973 has a quite extensive of writeup of the whys here.

Tl;dr: We might wait for 10s before we run the readiness probe and thus call the container ready. That's kinda long.

Release Note

Fixed a regression where the pod bringup time might have a latency of 10s or more even though the container should be up quickly.

/assign @vagababov @julz

@knative-prow-robot knative-prow-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/API API objects and controllers labels Mar 19, 2021
@google-cla google-cla bot added the cla: yes Indicates the PR's author has signed the CLA. label Mar 19, 2021
@codecov
Copy link

codecov bot commented Mar 19, 2021

Codecov Report

Merging #10992 (fdff52f) into main (90de379) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #10992      +/-   ##
==========================================
+ Coverage   87.97%   87.99%   +0.02%     
==========================================
  Files         188      188              
  Lines        9104     9115      +11     
==========================================
+ Hits         8009     8021      +12     
- Misses        837      839       +2     
+ Partials      258      255       -3     
Impacted Files Coverage Δ
pkg/reconciler/revision/resources/queue.go 100.00% <100.00%> (ø)
pkg/reconciler/route/resources/ingress.go 92.50% <0.00%> (-3.53%) ⬇️
pkg/reconciler/route/route.go 76.43% <0.00%> (ø)
pkg/autoscaler/statforwarder/leases.go 78.41% <0.00%> (+1.43%) ⬆️
pkg/autoscaler/statforwarder/processor.go 94.44% <0.00%> (+5.55%) ⬆️
pkg/autoscaler/statforwarder/forwarder.go 96.29% <0.00%> (+5.55%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 90de379...fdff52f. Read the comment docs.

Copy link
Member

@julz julz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2021
Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

It would be nice if we could change this based on the reported apiserver version, but waiting ~9 months seems okay given the amount of code that would be required.

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson, julz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2021
@knative-prow-robot knative-prow-robot merged commit 9ee92b1 into knative:main Mar 22, 2021
markusthoemmes added a commit to markusthoemmes/knative-serving that referenced this pull request Mar 23, 2021
openshift-merge-robot pushed a commit to openshift/knative-serving that referenced this pull request Mar 24, 2021
…ency (#703)

* Only use exec probe at startup time (knative#10741)

* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments

* Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992)

* Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748)

* Avoid deleting a stat if a request raced the reporter

This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel.

That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine.

The In event is lost and the stats concurrency becomes negative, unrecoverably.

* Avoid pointer

Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
markusthoemmes added a commit to markusthoemmes/knative-serving that referenced this pull request Apr 7, 2021
…ency (knative#703)

* Only use exec probe at startup time (knative#10741)

* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments

* Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992)

* Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748)

* Avoid deleting a stat if a request raced the reporter

This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel.

That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine.

The In event is lost and the stats concurrency becomes negative, unrecoverably.

* Avoid pointer

Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
markusthoemmes added a commit to markusthoemmes/knative-serving that referenced this pull request Apr 7, 2021
…ency (knative#703)

* Only use exec probe at startup time (knative#10741)

* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments

* Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992)

* Avoid deleting a stat if a request raced the reporter (knative#10729) (knative#10748)

* Avoid deleting a stat if a request raced the reporter

This prevents a race between the report routine and requests flowing in and out. Since we're trying to minimize contention of the request path, the locking routines try to grab as little write-locks as possible, to allow things to progress in parallel.

That breaks though if a report would report AverageConcurrency == 0 and hence marking the stat for deletion. If between this being done and the entry actually being deleted (two separate locks as we only grab a read lock for determining the deletion) comes a nwe request, it'll grab the stat that is now going to be deleted and hence not seen by the next report routine.

The In event is lost and the stats concurrency becomes negative, unrecoverably.

* Avoid pointer

Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
markusthoemmes added a commit to markusthoemmes/knative-serving that referenced this pull request Apr 7, 2021
openshift-merge-robot pushed a commit to openshift/knative-serving that referenced this pull request Apr 7, 2021
* Fix race condition with Prober logger in upgrade tests (#670)

Fixes knative/pkg#2026

The actual issue is that the test context expires between individual
stages run by the upgrade framework. This fix passes an external logger
that survives the stages.

* Only use exec probe at startup time (knative#10741)

* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments

* Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992)

Co-authored-by: Martin Gencur <mgencur@redhat.com>
Co-authored-by: Julian Friedman <julz.friedman@uk.ibm.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pod-startup regression due to probing periods
6 participants