Skip to content

Update Out-of-Tree Azure Cloud Provider KEP #2028

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Sep 30, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 171 additions & 14 deletions keps/sig-cloud-provider/azure/20190125-out-of-tree-azure.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ approvers:
- "@jagosan"
editor: "@feiskyer"
creation-date: 2019-01-29
last-updated: 2020-01-18
last-updated: 2020-09-29
status: implementable
---

Expand All @@ -40,8 +40,17 @@ status: implementable
- [Design Details](#design-details)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Alpha -> Beta Graduation](#alpha---beta-graduation)
- [Beta -> GA Graduation](#beta---ga-graduation)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
- [Monitoring Requirements](#monitoring-requirements)
- [Dependencies](#dependencies)
- [Scalability](#scalability)
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Technical Leads are members of the Kubernetes Organization](#technical-leads-are-members-of-the-kubernetes-organization)
- [Subproject Leads](#subproject-leads)
Expand All @@ -50,18 +59,22 @@ status: implementable

## Release Signoff Checklist

- [X] k/enhancements issue in release milestone and linked to KEP (https://github.com/kubernetes/enhancements/issues/667)
- [X] KEP approvers have set the KEP status to `implementable`
- [X] Design details are appropriately documented
- [X] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [X] Graduation criteria is in place
- [X] "Implementation History" section is up-to-date for milestone
- [X] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Items marked with (R) are required *prior to targeting to a milestone / release*.

- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://github.com/kubernetes/enhancements/issues/667).
- [x] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [x] (R) Graduation criteria is in place
- [x] (R) Production readiness review completed
- [x] Production readiness review approved
- [x] "Implementation History" section is up-to-date for milestone
- [x] User-facing documentation has been created in [kubernetes-sigs/cloud-provider-azure](https://kubernetes-sigs.github.io/cloud-provider-azure/)
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

## Summary

Build support for the out-of-tree Azure cloud provider. This involves a well-tested version of the cloud-controller-manager
that has feature parity to the kube-controller-manager.
Build support for the out-of-tree Azure cloud provider. This involves a well-tested version of the cloud-controller-manager that has feature parity to the kube-controller-manager.

## Motivation

Expand Down Expand Up @@ -124,7 +137,7 @@ cloud-provider-azure/

- The core of Azure cloud provider would be moved to [kubernetes-sigs/cloud-provider-azure](https://github.com/kubernetes-sigs/cloud-provider-azure).
- The storage drivers would be moved to [kubernetes-sigs/azuredisk-csi-driver](https://github.com/kubernetes-sigs/azuredisk-csi-driver) and [kubernetes-sigs/azurefile-csi-driver](https://github.com/kubernetes-sigs/azurefile-csi-driver).
- The credential provider is still under discussion on [kubernetes/cloud-provider#13](https://github.com/kubernetes/cloud-provider/issues/13).
- The credential provider is tracked by out-of-tree credential provider [KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-cloud-provider/20191004-out-of-tree-credential-providers.md) and it won't block the progress of this feature.

### Risks and Mitigation

Expand Down Expand Up @@ -162,13 +175,27 @@ See [report](https://testgrid.k8s.io/provider-azure-cloud-provider-azure) for mo

- Azure cloud controller manager is moving to GA
- Feature compatible with KCM
- Conformance tests are passed and published to testgrid
- Conformance tests are passed and published to [testgrid](https://testgrid.k8s.io/provider-azure-cloud-provider-azure)
- CSI drivers for AzureDisk/AzureFile are moving to GA
- Feature compatible with KCM
- Conformance tests are passed and published to testgrid
- Features implemented from CSI API SPEC
- Conformance tests are passed and published to [testgrid](https://testgrid.k8s.io/provider-azure-azuredisk-csi-driver)
- Azure credential provider is still supported in Kubelet
- Feature compatible with KCM
- Conformance tests are passed and published to testgrid
- Features implemented from CSI API SPEC
- Conformance tests are passed and published to [testgrid](https://testgrid.k8s.io/provider-azure-cloud-provider-azure)

#### Alpha -> Beta Graduation

- E2E tests have been added in [testgrid](https://testgrid.k8s.io/provider-azure-cloud-provider-azure)
- The same set of tests have been passed with out-of-tree projects
- All the features from in-tree implementations are still supported

#### Beta -> GA Graduation

- Code changes are decoupled from in-tree cloud provide (e.g. it shouldn't vendor in-tree implementations directly)
- E2E tests have been run stably (e.g. no flaky tests)
- Upgrade tests and scalability tests have been passed

### Upgrade / Downgrade Strategy

Expand All @@ -181,6 +208,136 @@ For each Kubernetes minor releases (e.g. v1.15.x), dedicated Azure cloud control
- The version matrix for Azure cloud controller manager would be documented on [kubernetes/cloud-provider-azure](https://github.com/kubernetes/cloud-provider-azure/blob/master/README.md#current-status).
- The version matrix for CSI drivers would be documented on [kubernetes-sigs/azuredisk-csi-driver](https://github.com/kubernetes-sigs/azuredisk-csi-driver#container-images--csi-compatibility) and [kubernetes-sigs/azurefile-csi-driver](https://github.com/kubernetes-sigs/azurefile-csi-driver#container-images--csi-compatibility).

## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

_This section must be completed when targeting alpha to a release._

* **How can this feature be enabled / disabled in a live cluster?**
- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: CSIMigrationAzureDisk and CSIMigrationAzureFile
- Components depending on the feature gate: kube-controller-manager and kubelet
- [x] Other
- Describe the mechanism: deploy cloud-controller-manager, cloud-node-manager and CSI drivers in the cluster.
- Will enabling / disabling the feature require downtime of the control
plane? `--cloud-provider=external` should be set for kube-controller-manager.
- Will enabling / disabling the feature require downtime or reprovisioning
of a node? --cloud-provider=external` should be set for for kubelet.

* **Does enabling the feature change any default behavior?**

The default behaviors are still same as before.

* **Can the feature be disabled once it has been enabled (i.e. can we roll back
the enablement)?**

Yes. Delete the cloud-controller-manager and cloud-node-manager, then change the `--cloud-provider`
option back to `azure` would still work. CSI drivers should be kept to ensure CSI-provisioned PVCs are still working.

* **What happens if we reenable the feature if it was previously rolled back?**

It would still work as expected.

* **Are there any tests for feature enablement/disablement?**

E2E tests have already been added and results are published on testgrid.

### Rollout, Upgrade and Rollback Planning

_This section must be completed when targeting beta graduation to a release._

* **How can a rollout fail? Can it impact already running workloads?**

Wrong component configurations may cause rollout fail, and running workloads won't be impacted.

* **What specific metrics should inform a rollback?**

Couldn't create a LoadBalancer typed service or AzureDisk PVC indicate the rollout needs to rollback.

* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**

Manually changing the `--cloud-provider` options have been verified. For upgrade->downgrade,
the volumes provisioned by CSI drivers should continue to be managed by CSI drivers. They're
not able to migrate to in-tree drivers.

* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
fields of API types, flags, etc.?**

In-tree AzureDisk/AzureFile drivers would be migrated to CSI drivers automatically.

### Monitoring Requirements

_This section must be completed when targeting beta graduation to a release._

* **How can an operator determine if the feature is in use by workloads?**

Operation specific metrics (e.g. LoadBalancer creation and route table update) have been added.

* **What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?**
- [x] Metrics
- Metric names:
- cloudprovider_azure_op_duration_seconds
- cloudprovider_azure_api_request_errors
- cloudprovider_azure_api_request_throttled_count
- cloudprovider_azure_op_duration_seconds_bucket
- Components exposing the metric: cloud-controller-manager and CSI drivers

* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**

- 99.5% of read and write ARM requests in the last 5 minutes were successful
- LoadBalancer service requests in the last 5 minutes are served in 60 seconds @99th percentile
- Routes for new nodes in the last 5 minutes are served in 90 seconds @99th percentile
- Disk PVC attach requests in the last 5 minutes are served in 60 seconds @99th percentile

### Dependencies

_This section must be completed when targeting beta graduation to a release._

* **Does this feature depend on any specific services running in the cluster?**

CSI drivers for AzureDisk/AzureFile are required for out-of-tree cloud provider,
and their plans has already been added in above designs.

### Scalability

_For alpha, this section is encouraged: reviewers should consider these questions
and attempt to answer them._

_For beta, this section is required: reviewers must answer these questions._

_For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field._

* **Will enabling / using this feature result in any new API calls?**

Yes, CSI drivers for AzureDisk/AzureFile would be introduced.

* **Will enabling / using this feature result in introducing new API types?**

Yes, CSI drivers AzureDisk/AzureFile would be introduced.

### Troubleshooting

The Troubleshooting section currently serves the `Playbook` role. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now, we leave it here.

_This section must be completed when targeting beta graduation to a release._

* **How does this feature react if the API server and/or etcd is unavailable?**

Same as before.

* **What are other known failure modes?**

Refer <https://kubernetes-sigs.github.io/cloud-provider-azure/faq>.

* **What steps should be taken if SLOs are not being met to determine the problem?**

Check the debug logs of cloud-provider-azure since detailed steps are logged in debug level.

## Implementation History

See [kubernetes/cloud-provider-azure#pulls](https://github.com/kubernetes/cloud-provider-azure/pulls?utf8=%E2%9C%93&q=+is%3Apr+), [kubernetes-sigs/azuredisk-csi-driver#pulls](https://github.com/kubernetes-sigs/azuredisk-csi-driver/pulls?utf8=%E2%9C%93&q=is%3Apr++) and [kubernetes-sigs/azurefile-csi-driver#pulls](https://github.com/kubernetes-sigs/azurefile-csi-driver/pulls?utf8=%E2%9C%93&q=is%3Apr++).
Expand Down