Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

add cert auto-recovery for kube-apiserver and kube-controller-manager design #1

Merged
merged 2 commits into from
Oct 15, 2019

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Aug 22, 2019

Fully automate the recovery of kube-apiserver and kube-controller-manager certificates currently documented here.
Currently, there are are helper commands to make the effort more practical, but we think we can fully automate the process
to avoid human error and intervention.

This will make integrations from teams like code-ready-containers and training easier to build and reduce our support burden in recovery cases.

@openshift/sig-master

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 22, 2019
@deads2k
Copy link
Contributor Author

deads2k commented Sep 19, 2019

/assign @derekwaynecarr @mfojtik

/cc @tnozicka

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 20, 2019
Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(half pass, will continue tomorrow, but it seem good so far)

1. kas-static-pod/kube-apiserver starts with expired certificates
2. kas-static-pod/cert-syncer connects to localhost kube-apiserver with using a long-lived SNI cert (localhost-recovery). It sees expired certs.
3. kas-static-poc/cert-regenerator connects to localhost kube-apiserver with a long-lived SNI cert (localhost-recovery). It sees expired certs and refreshes them as appropriate. Being in the same
repo, it uses the same logic. We will probably add an overall option to the library-go cert rotation to say, "only refresh on expired"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only refresh on expired

I'd take that as a requirement. I'd not want the cert-regenerator to renew any other time. (e.g. if you investigating operator logs, you should be able to reason about cert flow at least in normal circumstances

so that it never collides with the operator during normal operation. The library-go cert rotation impl is resilient to
multiple actors already.
9. kcm-static-pod/cert-syncer sees updated certs and places them for reload. (this already works)
10. kcm-static-pod/kube-controller-manager wires up a library-go/pkg/controller/fileobserver to the CSR signer and suicides on the update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might worth to describe it up to a point where operators start running again - kube scheduler would just see the new valid apiserver serving cert, cert-syncer would sync the new client-certs and start scheduling pods, then operators start coming up and MCO will fix kubelet serving certs for logs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, what approves the CSRs? sounds like this is close to Non-goals of this proposal but we should at least have an overall vision so other teams can follow and fill in the blanks in their own proposals.

so if kubelet client certs are expired - kubelet sends CSR request to apiserver, and it can't list/run pods until that's approved, yet the approver (a cloud team pod) is not running as static pod - sounds like it should

Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits, I think the overall idea is perfectly reasonable. I don't think at this point we should go too much in the details of implementation. We can always update the proposal as is. I'd like to see this merged soon, in that case.

## Summary

Fully automate the recovery of kube-apiserver and kube-controller-manager certificates currently documented [here](https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html).
Currently, there are are helper commands to make the effort more practical, but we think we can fully automate the process
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are are


We will take our existing `cluster-kube-apiserver-operator regenerated-certificates` command and create a simple, non-leader-elected
controller which will watch for expired certificates and regenerate them. It will connect to the kube-apiserver using
localhost with an SNI name option wired to a 10 year cert. When there is no work to do, this controller wil do nothing.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/wil/will

Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve it overall, few nits to clarify

Disaster recovery tests are still outstanding with an epic that may not be approved. Lack of testing here doesn't introduce
additional risk beyond that already accepted.

This will be tested as part of normal disaster recovery tests. It's built on already unit tested libraries and affects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be tested as part of normal disaster recovery tests. - I am not sure what it means (the current DR tests are feature targeted)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be tested as part of normal disaster recovery tests. - I am not sure what it means (the current DR tests are feature targeted)

and this proposal doesn't effect that one way or the other.


## Drawbacks

This process can be run by a laborious and error prone manual process that three existent teams have already had trouble with.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sound more like a bad alternative then drawback :)


We will take our existing `cluster-kube-apiserver-operator regenerated-certificates` command and create a simple, non-leader-elected
controller which will watch for expired certificates and regenerate them. It will connect to the kube-apiserver using
localhost with an SNI name option wired to a 10 year cert. When there is no work to do, this controller wil do nothing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should mention if this will be a separate pod or part of the operand

@tnozicka
Copy link
Contributor

tnozicka commented Oct 9, 2019

might worth to mention in the proposal that this will replace the existing manual recovery process and we should drop the recovery apiserver and regenerate-certs when this is stable

Copy link

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
We'll update as we go, this is good for starters.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2019
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, soltysh, tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit c55ee82 into openshift:master Oct 15, 2019
marun added a commit to marun/enhancements that referenced this pull request Dec 12, 2019
marun added a commit to marun/enhancements that referenced this pull request Jan 23, 2020
marun added a commit to marun/enhancements that referenced this pull request Jan 24, 2020
marun added a commit to marun/enhancements that referenced this pull request Feb 3, 2020
openshift-merge-robot pushed a commit that referenced this pull request Mar 13, 2020
pedjak pushed a commit to pedjak/openshift-enhancements that referenced this pull request Mar 30, 2020
restructured to be more compliant with the enhancement template
bcrochet pushed a commit to bcrochet/enhancements that referenced this pull request May 4, 2020
dgoodwin pushed a commit to dgoodwin/enhancements that referenced this pull request May 13, 2020
Updates to expand focus to other providers.
benoitf pushed a commit to benoitf/enhancements that referenced this pull request May 14, 2020
feat(annotation): Provides more help on this annotation
Fedosin pushed a commit to Fedosin/enhancements that referenced this pull request Oct 7, 2020
marun added a commit to marun/enhancements that referenced this pull request Jul 16, 2021
cgwalters pushed a commit to cgwalters/enhancements that referenced this pull request Oct 19, 2021
cosa/cli-spec: add envVar CLI support
stevekuznetsov pushed a commit to stevekuznetsov/enhancements that referenced this pull request Apr 3, 2023
Add keps/core/decouple-logical-clusters-from-hierarchy.md
Tal-or added a commit to Tal-or/enhancements that referenced this pull request May 7, 2023
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Tal-or added a commit to Tal-or/enhancements that referenced this pull request May 7, 2023
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Tal-or added a commit to Tal-or/enhancements that referenced this pull request May 7, 2023
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@beekhof beekhof mentioned this pull request Oct 12, 2024
jaypoulz pushed a commit to jaypoulz/enhancements that referenced this pull request Nov 20, 2024
2NO enhancement grammar pass and adding open questions.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants