-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
add cert auto-recovery for kube-apiserver and kube-controller-manager design #1
add cert auto-recovery for kube-apiserver and kube-controller-manager design #1
Conversation
/assign @derekwaynecarr @mfojtik /cc @tnozicka |
fb169e8
to
c0b31f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(half pass, will continue tomorrow, but it seem good so far)
kube-apiserver/auto-cert-recovery.md
Outdated
1. kas-static-pod/kube-apiserver starts with expired certificates | ||
2. kas-static-pod/cert-syncer connects to localhost kube-apiserver with using a long-lived SNI cert (localhost-recovery). It sees expired certs. | ||
3. kas-static-poc/cert-regenerator connects to localhost kube-apiserver with a long-lived SNI cert (localhost-recovery). It sees expired certs and refreshes them as appropriate. Being in the same | ||
repo, it uses the same logic. We will probably add an overall option to the library-go cert rotation to say, "only refresh on expired" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only refresh on expired
I'd take that as a requirement. I'd not want the cert-regenerator to renew any other time. (e.g. if you investigating operator logs, you should be able to reason about cert flow at least in normal circumstances
kube-apiserver/auto-cert-recovery.md
Outdated
so that it never collides with the operator during normal operation. The library-go cert rotation impl is resilient to | ||
multiple actors already. | ||
9. kcm-static-pod/cert-syncer sees updated certs and places them for reload. (this already works) | ||
10. kcm-static-pod/kube-controller-manager wires up a library-go/pkg/controller/fileobserver to the CSR signer and suicides on the update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might worth to describe it up to a point where operators start running again - kube scheduler would just see the new valid apiserver serving cert, cert-syncer would sync the new client-certs and start scheduling pods, then operators start coming up and MCO will fix kubelet serving certs for logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, what approves the CSRs? sounds like this is close to Non-goals of this proposal but we should at least have an overall vision so other teams can follow and fill in the blanks in their own proposals.
so if kubelet client certs are expired - kubelet sends CSR request to apiserver, and it can't list/run pods until that's approved, yet the approver (a cloud team pod) is not running as static pod - sounds like it should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits, I think the overall idea is perfectly reasonable. I don't think at this point we should go too much in the details of implementation. We can always update the proposal as is. I'd like to see this merged soon, in that case.
kube-apiserver/auto-cert-recovery.md
Outdated
## Summary | ||
|
||
Fully automate the recovery of kube-apiserver and kube-controller-manager certificates currently documented [here](https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html). | ||
Currently, there are are helper commands to make the effort more practical, but we think we can fully automate the process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are are
kube-apiserver/auto-cert-recovery.md
Outdated
|
||
We will take our existing `cluster-kube-apiserver-operator regenerated-certificates` command and create a simple, non-leader-elected | ||
controller which will watch for expired certificates and regenerate them. It will connect to the kube-apiserver using | ||
localhost with an SNI name option wired to a 10 year cert. When there is no work to do, this controller wil do nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/wil/will
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve it overall, few nits to clarify
kube-apiserver/auto-cert-recovery.md
Outdated
Disaster recovery tests are still outstanding with an epic that may not be approved. Lack of testing here doesn't introduce | ||
additional risk beyond that already accepted. | ||
|
||
This will be tested as part of normal disaster recovery tests. It's built on already unit tested libraries and affects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be tested as part of normal disaster recovery tests.
- I am not sure what it means (the current DR tests are feature targeted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be tested as part of normal disaster recovery tests.
- I am not sure what it means (the current DR tests are feature targeted)
and this proposal doesn't effect that one way or the other.
kube-apiserver/auto-cert-recovery.md
Outdated
|
||
## Drawbacks | ||
|
||
This process can be run by a laborious and error prone manual process that three existent teams have already had trouble with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sound more like a bad alternative then drawback :)
kube-apiserver/auto-cert-recovery.md
Outdated
|
||
We will take our existing `cluster-kube-apiserver-operator regenerated-certificates` command and create a simple, non-leader-elected | ||
controller which will watch for expired certificates and regenerate them. It will connect to the kube-apiserver using | ||
localhost with an SNI name option wired to a 10 year cert. When there is no work to do, this controller wil do nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should mention if this will be a separate pod or part of the operand
might worth to mention in the proposal that this will replace the existing manual recovery process and we should drop the recovery apiserver and regenerate-certs when this is stable |
c0b31f4
to
3e93e2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
We'll update as we go, this is good for starters.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, soltysh, tnozicka The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add option for cloud-operator
restructured to be more compliant with the enhancement template
Update to latest information
Updates to expand focus to other providers.
feat(annotation): Provides more help on this annotation
Fedosin's changes
cosa/cli-spec: add envVar CLI support
Add keps/core/decouple-logical-clusters-from-hierarchy.md
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2NO enhancement grammar pass and adding open questions.
Fully automate the recovery of kube-apiserver and kube-controller-manager certificates currently documented here.
Currently, there are are helper commands to make the effort more practical, but we think we can fully automate the process
to avoid human error and intervention.
This will make integrations from teams like code-ready-containers and training easier to build and reduce our support burden in recovery cases.
@openshift/sig-master