add cert auto-recovery for kube-apiserver and kube-controller-manager design #1

deads2k · 2019-08-22T17:18:11Z

Fully automate the recovery of kube-apiserver and kube-controller-manager certificates currently documented here.
Currently, there are are helper commands to make the effort more practical, but we think we can fully automate the process
to avoid human error and intervention.

This will make integrations from teams like code-ready-containers and training easier to build and reduce our support burden in recovery cases.

@openshift/sig-master

kube-apiserver/auto-cert-recovery.md

deads2k · 2019-09-19T22:16:41Z

/assign @derekwaynecarr @mfojtik

/cc @tnozicka

tnozicka

(half pass, will continue tomorrow, but it seem good so far)

tnozicka · 2019-10-08T14:47:08Z

kube-apiserver/auto-cert-recovery.md

+1. kas-static-pod/kube-apiserver starts with expired certificates
+2. kas-static-pod/cert-syncer connects to localhost kube-apiserver with using a long-lived SNI cert (localhost-recovery).  It sees expired certs.
+3. kas-static-poc/cert-regenerator connects to localhost kube-apiserver with a long-lived SNI cert (localhost-recovery).  It sees expired certs and refreshes them as appropriate.  Being in the same
+ repo, it uses the same logic.  We will probably add an overall option to the library-go cert rotation to say, "only refresh on expired"


only refresh on expired

I'd take that as a requirement. I'd not want the cert-regenerator to renew any other time. (e.g. if you investigating operator logs, you should be able to reason about cert flow at least in normal circumstances

tnozicka · 2019-10-08T14:58:12Z

kube-apiserver/auto-cert-recovery.md

+ so that it never collides with the operator during normal operation.  The library-go cert rotation impl is resilient to 
+ multiple actors already.  
+9. kcm-static-pod/cert-syncer sees updated certs and places them for reload. (this already works)
+10. kcm-static-pod/kube-controller-manager wires up a library-go/pkg/controller/fileobserver to the CSR signer and suicides on the update


might worth to describe it up to a point where operators start running again - kube scheduler would just see the new valid apiserver serving cert, cert-syncer would sync the new client-certs and start scheduling pods, then operators start coming up and MCO will fix kubelet serving certs for logs

hm, what approves the CSRs? sounds like this is close to Non-goals of this proposal but we should at least have an overall vision so other teams can follow and fill in the blanks in their own proposals.

so if kubelet client certs are expired - kubelet sends CSR request to apiserver, and it can't list/run pods until that's approved, yet the approver (a cloud team pod) is not running as static pod - sounds like it should

soltysh

Nits, I think the overall idea is perfectly reasonable. I don't think at this point we should go too much in the details of implementation. We can always update the proposal as is. I'd like to see this merged soon, in that case.

soltysh · 2019-10-09T14:06:44Z

kube-apiserver/auto-cert-recovery.md

+## Summary
+
+Fully automate the recovery of kube-apiserver and kube-controller-manager certificates currently documented [here](https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html).
+Currently, there are are helper commands to make the effort more practical, but we think we can fully automate the process


are ~~are~~

soltysh · 2019-10-09T14:08:18Z

kube-apiserver/auto-cert-recovery.md

+
+We will take our existing `cluster-kube-apiserver-operator regenerated-certificates` command and create a simple, non-leader-elected
+controller which will watch for expired certificates and regenerate them.  It will connect to the kube-apiserver using
+localhost with an SNI name option wired to a 10 year cert.  When there is no work to do, this controller wil do nothing.


tnozicka

approve it overall, few nits to clarify

tnozicka · 2019-10-09T14:38:18Z

kube-apiserver/auto-cert-recovery.md

+Disaster recovery tests are still outstanding with an epic that may not be approved.  Lack of testing here doesn't introduce
+additional risk beyond that already accepted.
+
+This will be tested as part of normal disaster recovery tests.  It's built on already unit tested libraries and affects


This will be tested as part of normal disaster recovery tests. - I am not sure what it means (the current DR tests are feature targeted)

This will be tested as part of normal disaster recovery tests. - I am not sure what it means (the current DR tests are feature targeted)

and this proposal doesn't effect that one way or the other.

tnozicka · 2019-10-09T14:46:32Z

kube-apiserver/auto-cert-recovery.md

+
+## Drawbacks
+
+This process can be run by a laborious and error prone manual process that three existent teams have already had trouble with.


sound more like a bad alternative then drawback :)

tnozicka · 2019-10-09T14:48:11Z

kube-apiserver/auto-cert-recovery.md

+
+We will take our existing `cluster-kube-apiserver-operator regenerated-certificates` command and create a simple, non-leader-elected
+controller which will watch for expired certificates and regenerate them.  It will connect to the kube-apiserver using
+localhost with an SNI name option wired to a 10 year cert.  When there is no work to do, this controller wil do nothing.


it should mention if this will be a separate pod or part of the operand

tnozicka · 2019-10-09T14:50:50Z

might worth to mention in the proposal that this will replace the existing manual recovery process and we should drop the recovery apiserver and regenerate-certs when this is stable

soltysh

/lgtm
We'll update as we go, this is good for starters.

openshift-ci-robot · 2019-10-15T10:52:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, soltysh, tnozicka

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add option for cloud-operator

restructured to be more compliant with the enhancement template

Update to latest information

Updates to expand focus to other providers.

feat(annotation): Provides more help on this annotation

Fedosin's changes

cosa/cli-spec: add envVar CLI support

Add keps/core/decouple-logical-clusters-from-hierarchy.md

Signed-off-by: Talor Itzhak <titzhak@redhat.com>

2NO enhancement grammar pass and adding open questions.

openshift-ci-robot requested review from sjenning and soltysh August 22, 2019 17:18

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 22, 2019

enj reviewed Aug 22, 2019

View reviewed changes

kube-apiserver/auto-cert-recovery.md Outdated Show resolved Hide resolved

kube-apiserver/auto-cert-recovery.md Outdated Show resolved Hide resolved

anjannath mentioned this pull request Aug 28, 2019

Retrigger certificate rotation/generation for internal cluster communication crc-org/crc#11

Closed

openshift-ci-robot assigned derekwaynecarr and mfojtik Sep 19, 2019

openshift-ci-robot requested a review from tnozicka September 19, 2019 22:16

deads2k force-pushed the auto-recovery branch from fb169e8 to c0b31f4 Compare September 20, 2019 16:05

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 20, 2019

tnozicka reviewed Oct 8, 2019

View reviewed changes

soltysh approved these changes Oct 9, 2019

View reviewed changes

tnozicka approved these changes Oct 9, 2019

View reviewed changes

deads2k added 2 commits October 11, 2019 11:38

add auto-recovery design

a38a0cd

add clarifiation to automatic cert recovery flow

3e93e2c

deads2k force-pushed the auto-recovery branch from c0b31f4 to 3e93e2c Compare October 11, 2019 15:46

soltysh approved these changes Oct 15, 2019

View reviewed changes

openshift-ci-robot assigned soltysh Oct 15, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 15, 2019

openshift-merge-robot merged commit c55ee82 into openshift:master Oct 15, 2019

marun added a commit to marun/enhancements that referenced this pull request Dec 12, 2019

Fixup openshift#1: Address comment

464d089

marun added a commit to marun/enhancements that referenced this pull request Jan 23, 2020

Fixup openshift#1

0cbd4c6

marun added a commit to marun/enhancements that referenced this pull request Jan 24, 2020

Fixup openshift#1

378a616

marun added a commit to marun/enhancements that referenced this pull request Feb 3, 2020

Fixup openshift#1

73e2c77

travisn mentioned this pull request Feb 27, 2020

Adds proposal for AutoDetection on LSO #190

Closed

openshift-merge-robot pushed a commit that referenced this pull request Mar 13, 2020

Merge pull request #1 from jsafrane/csi-install-cloud-operator

fbca879

Add option for cloud-operator

pedjak pushed a commit to pedjak/openshift-enhancements that referenced this pull request Mar 30, 2020

Merge pull request openshift#1 from pedjak/helm-3-implementation-console

f1fc5d9

restructured to be more compliant with the enhancement template

bcrochet pushed a commit to bcrochet/enhancements that referenced this pull request May 4, 2020

Merge pull request openshift#1 from bcrochet/kubernetes-nmstate

5123b5b

Update to latest information

dgoodwin pushed a commit to dgoodwin/enhancements that referenced this pull request May 13, 2020

Merge pull request openshift#1 from dgoodwin/signal-cluster-deletion

05eba0a

Updates to expand focus to other providers.

benoitf pushed a commit to benoitf/enhancements that referenced this pull request May 14, 2020

Merge pull request openshift#1 from benoitf/add-doc

e9416f9

feat(annotation): Provides more help on this annotation

russellb mentioned this pull request Jul 14, 2020

network: on premise Service load balancers #356

Merged

Fedosin pushed a commit to Fedosin/enhancements that referenced this pull request Oct 7, 2020

Merge pull request openshift#1 from Fedosin/out-of-tree

71a3eeb

Fedosin's changes

gnufied mentioned this pull request Oct 14, 2020

Add vsphere csi operator enhancement #493

Merged

sferich888 mentioned this pull request Dec 18, 2020

baremetal: Propose BMC-less remediation enhancement (AKA poison pill) #547

Closed

candita mentioned this pull request Mar 31, 2021

ingress: Add transition-ingress-from-beta-to-stable enhancement #697

Merged

4 tasks

hardys mentioned this pull request Jun 29, 2021

Baremetal IPI Network Configuration for Day-1 #817

Merged

wking mentioned this pull request Jul 13, 2021

enhancements/update/targeted-update-edge-blocking: Propose a new enhancement #821

Merged

marun added a commit to marun/enhancements that referenced this pull request Jul 16, 2021

Fixup openshift#1

2c03d89

raffaelespazzoli mentioned this pull request Sep 22, 2021

Cloud API component for egress IP #571

Merged

cgwalters pushed a commit to cgwalters/enhancements that referenced this pull request Oct 19, 2021

Merge pull request openshift#1 from darkmuggle/pr/cosa/cli-spec

49f70aa

cosa/cli-spec: add envVar CLI support

csrwng mentioned this pull request Dec 16, 2021

Added proposal for HyperShift monitoring. #981

Merged

stevekuznetsov pushed a commit to stevekuznetsov/enhancements that referenced this pull request Apr 3, 2023

Merge pull request openshift#1 from sttts/sttts-clusterworkspaces

c728116

Add keps/core/decouple-logical-clusters-from-hierarchy.md

Tal-or added a commit to Tal-or/enhancements that referenced this pull request May 7, 2023

enhancement: comment addrssing openshift#1

6e2389d

Signed-off-by: Talor Itzhak <titzhak@redhat.com>

Tal-or added a commit to Tal-or/enhancements that referenced this pull request May 7, 2023

enhancement: comment addrssing \openshift#1

5dac969

Signed-off-by: Talor Itzhak <titzhak@redhat.com>

Tal-or added a commit to Tal-or/enhancements that referenced this pull request May 7, 2023

enhancement: comment addrssing 'openshift#1'

05d7411

Signed-off-by: Talor Itzhak <titzhak@redhat.com>

jewzaam mentioned this pull request Mar 1, 2024

Add Change Management and Maintenance Schedules #1571

Open

gcs278 mentioned this pull request Jun 7, 2024

NE-1517: Set EIP for NLB Ingress controller. #1593

Merged

cgwalters mentioned this pull request Jun 11, 2024

Split RHCOS into layers #1637

Merged

ShaunaDiaz mentioned this pull request Oct 2, 2024

USHIFT-4377: introduce microshift ingress performance enhancement #1685

Merged

beekhof mentioned this pull request Oct 12, 2024

Two nodes openshift #1675

Open

jaypoulz pushed a commit to jaypoulz/enhancements that referenced this pull request Nov 20, 2024

Merge pull request openshift#1 from jaypoulz/2no

9532330

2NO enhancement grammar pass and adding open questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cert auto-recovery for kube-apiserver and kube-controller-manager design #1

add cert auto-recovery for kube-apiserver and kube-controller-manager design #1

deads2k commented Aug 22, 2019

deads2k commented Sep 19, 2019

tnozicka left a comment

tnozicka Oct 8, 2019

tnozicka Oct 8, 2019

tnozicka Oct 9, 2019

soltysh left a comment

soltysh Oct 9, 2019

soltysh Oct 9, 2019

tnozicka left a comment

tnozicka Oct 9, 2019

deads2k Oct 11, 2019

tnozicka Oct 9, 2019

tnozicka Oct 9, 2019

tnozicka commented Oct 9, 2019 •

edited

Loading

soltysh left a comment

openshift-ci-robot commented Oct 15, 2019


		## Drawbacks

		This process can be run by a laborious and error prone manual process that three existent teams have already had trouble with.

add cert auto-recovery for kube-apiserver and kube-controller-manager design #1

add cert auto-recovery for kube-apiserver and kube-controller-manager design #1

Conversation

deads2k commented Aug 22, 2019

deads2k commented Sep 19, 2019

tnozicka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tnozicka commented Oct 9, 2019 • edited Loading

soltysh left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Oct 15, 2019

tnozicka commented Oct 9, 2019 •

edited

Loading