From 73bfea991d7edc2c5ffead39a8f44fdcb3a6ccdc Mon Sep 17 00:00:00 2001
From: elmiko <msm@opbstudios.com>
Date: Fri, 10 Jan 2025 16:57:31 -0500
Subject: [PATCH] add cluster api autoscaler integration enhancement

---
 ...-integration-with-openshift-cluster-api.md | 709 ++++++++++++++++++
 1 file changed, 709 insertions(+)
 create mode 100644 enhancements/cluster-api/cluster-autoscaler-integration-with-openshift-cluster-api.md

diff --git a/enhancements/cluster-api/cluster-autoscaler-integration-with-openshift-cluster-api.md b/enhancements/cluster-api/cluster-autoscaler-integration-with-openshift-cluster-api.md
new file mode 100644
index 0000000000..b6a7e51e3b
--- /dev/null
+++ b/enhancements/cluster-api/cluster-autoscaler-integration-with-openshift-cluster-api.md
@@ -0,0 +1,709 @@
+---
+title: cluster-autoscaler-integration-with-openshift-cluster-api
+authors:
+  - "@elmiko"
+reviewers:
+  - "@joelspeed"
+  - "@nrb"
+  - "@damdo"
+approvers:
+  - "@joelspeed"
+api-approvers:
+  - "@joelspeed"
+creation-date: 2025-01-16
+last-updated: 2025-01-16
+tracking-link:
+  - https://issues.redhat.com/browse/OCPCLOUD-2116
+see-also:
+  - https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/cluster-autoscaler-integration.md
+  - https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/cluster-autoscaler-operator.md
+  - https://github.com/openshift/enhancements/blob/master/enhancements/cluster-api/installing-cluster-api-components-in-ocp.md
+  - https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/converting-machine-api-to-cluster-api.md
+replaces: []
+superseded-by: []
+---
+
+# Cluster Autoscaler Integration with Cluster API
+
+## Summary
+
+The [Kubernetes Cluster Autoscaler][cas-repo] is a tool for automating provisioning
+of compute resources in an OpenShift clusters. It has been
+[integrated in OpenShift][cas-ocp] since before the 4.X major release series.
+With the migration of OpenShift's machine management components from Machine API to
+Cluster API, there are several open questions that must be answered about how the
+Cluster Autoscaler will integrate with Cluster API on OpenShift. This enhancemnt
+describes how the Cluster Autoscaler, and its operator, will be modified to
+interface with Cluster API resources and namespaces on OpenShift.
+
+## Motivation
+
+### User Stories
+
+* As an OpenShift administrator, I want the Cluster Autoscaler to function
+  as expected whether I am using the Machine API or Cluster API interface so
+  that my users do not experience a loss of features or significant down time.
+
+* As an OpenShift administrator, I want to utilize Cluster API as a single point
+  of interaction when using the Cluster Autoscaler for monitoring infrastructure
+  activity so that I can reduce the complexity of my automation and observability tooling.
+
+* As an OpenShift user, I want to focus my attention on a single API for machine
+  management so that I do not have to learn multiple interfaces for the same functionality.
+
+* As an OpenShift developer, I want to ensure that all features of the Cluster
+  Autoscaler continue to work as expected on Cluster API so that our users do
+  not experience a regression of functionality.
+
+* As an OpenShift developer, I want to ensure that the Cluster Autoscaler has
+  a consistent use pattern as we migrate from Machine API to Cluster API so that
+  users are not confused by unexpected behaviors and workflows.
+
+### Goals
+
+* Enable Cluster Autoscaler to use Cluster API resources instead of
+  Machine API resources on OpenShift.
+* Enable Cluster Autoscaler Operator to recognize Cluster API resource
+  targets in addition to Machine API resources.
+* Ensure mirroring of autoscaler specific metadata from Machine API
+  resources to Cluster API resources.
+* Add a conditions slice to the MachineAutoscaler resource status field.
+
+### Non-Goals
+
+* Change the expected namespace of operation for the Cluster Autoscaler or
+  Cluster Autoscaler Operator.
+* Change the expected workflow for ClusterAutoscaler and MachineAutoscaler
+  custom resources.
+* Enable the Cluster Autoscaler to manipulate both Machine API and Cluster
+  API MachineSets.
+
+## Proposal
+
+Update the Cluster Autoscaler to recognize Cluster API resources instead of Machine API.
+This change would bring our version of the autoscaler closer to the upstream
+version and would allow us to drop some patches we are carrying. The Cluster
+API MachineSet sync controller will be updated to recognize when the
+Cluster Autoscaler has made a change to a Cluster API resource and then sync
+the change to the corresponding Machine API resource, regardless of which resource
+is authoritative.
+
+Update the Cluster Autoscaler Operator to be namespace aware. This requires
+changing the operator to recognize when it has a Machine API or Cluster API
+reference in a MachineAutoscaler, and then use the appropriate namespace to
+locate the resource. The Cluster API MachineSet sync controller will be updated
+to ensure that when the Cluster Autoscaler Operator adds the autoscaling
+annotations that they are copied to any related resources, regardless of which
+is authoritative.
+
+Update the Cluster API MachineSet sync controller to recognize the
+scale-from-zero annotations and copy them from the Machine API resources to the
+Cluster API resources. This ensures consistent data representation on both
+the authoritative and non-authoritative records.
+
+Leave the Cluster Autoscaler and Cluster Autoscaler Operator in the
+openshift-machine-api namespace. This will continue the user experience for
+managing autoscaling.
+
+### Workflow Description
+
+**cluster administrator** is a human user responsible for managing MachineSet,
+ClusterAutoscaler, and MachineAutoscaler resources in a cluster.
+
+1. The cluster administrator creates a Cluster API MachineSet named "scaling-set-1"
+  intended for autoscaling.
+2. The cluster administrator creates a ClusterAutoscaler resource to deploy
+  the cluster autoscaler component in the cluster.
+3. The cluster adiminstrator creates a MachineAutoscaler resource referencing the
+  "scaling-set-1" MachineSet in the `scaleTargetRef` field.
+4. The cluster administrator can see from logs and metadata on the MachineSet
+  that it is being evaluated by the cluster autoscaler for autoscaling.
+
+In general, the workflow for users should continue to meet the current expectations
+for cluster autoscaler functionality. The main change to the previous workflows
+is the inclusion of Cluster API MachineSets (e.g. MachineSets that have the
+resource group `cluster.x-k8s.io`) in MachineAutoscaler resources.
+
+### API Extensions
+
+This enhancement will require a change to the `status` field of the MachineAutoscaler
+API type. There is no expected change to the `spec` field, or the behavior that
+users expect.
+
+The Cluster Autoscaler Operator will be changed to include logic that can detect
+the API group for any MachineSet that is referenced in the `scaleTargetRef` field
+of MachineAutoscaler resources. The change will instruct the Operator to search
+for records in the `openshift-cluster-api` namespace for resources with the
+`cluster.x-k8s.io` group, and to search in the `openshift-machine-api` namespace
+for resource with the `machine.openshift.io` group.
+
+This change could be incorporated into the API by adding a field for namespace in
+the `scaleTargetRef` structure, but given that OpenShift only allows MachineSet
+resources to be in one of the two namespaces listed previously, this API change is
+not necessary.
+
+The `status` field of the MachineAutoscaler resource will gain a `conditions` field.
+This field will be used to express the status of each MachineAutoscaler as it
+pertains to normal operation and conflicts that might arise from user configuration
+choices.
+
+### Topology Considerations
+
+#### Hypershift / Hosted Control Planes
+
+The changes in this enhancement are focused on OpenShift standalone
+topologies. Hypershift, and the hosted control plane topologies based
+on it, use Cluster API in a differnt configuration than standalone. As
+such, these clusters have their Cluster Autoscaler configured in a
+manner that does not depend on the Cluster Autoscaler Operator.
+
+Please see the Hypershift enhancement on [Node lifecycle][hcp-nl] for more
+information about the Cluster Autoscaler and its relationship to hosted
+control plane topologies.
+
+#### Standalone Clusters
+
+The changes in this enhancement are specifically meant for standalone clusters.
+These changes are not meant for deployment to Hypershift and hosted control plane
+topologies as they do not use the Cluster Autosaler Operator, and have a different
+configuration for the Cluster Autoscaler and Cluster API.
+
+#### Single-node Deployments or MicroShift
+
+The changes in this enhancement are not meant for single-node deployments. The
+Cluster Autoscaler and Cluster Autoscaler Operator are not deployed in single-node
+clusters, and as such this feature does not have relevance on that topology.
+
+### Implementation Details/Notes/Constraints
+
+The implementation changes for this enhancement will be done in the Cluster
+Autoscaler, Cluster Autoscaler Operator, and Cluster API MachineSet Sync Controller
+projects. Most of the details are internal changes to how Kubernetes resource data
+is applied and updated. The user-facing changes will be confined to the MachineAutoscaler
+resource associated with the Cluster Autoscaler Operator.
+
+#### Cluster Autoscaler Changes
+
+To ensure proper operation and reduce the complexity of integration, the Cluster
+Autoscaler will be configured to interface with Cluster API MachineSet resources.
+This change is primarily a change to the running configuration of the Cluster Autoscaler.
+
+Due to historical work on the Cluster Autoscaler and the sibling relationship of the
+Machine API and Cluster API, the Cluster Autoscaler customization has been reduced to
+environment variables and a few carried patches to express the Machine API functionality.
+Reconfiguring the Cluster Autoscaler for Cluster API will mean that some of the carried
+patches can be removed, and a change in the deployment configuration.
+
+#### MachineAutoscaler Spec Changes
+
+The main implementation detail of this enhancement that users will interact with is
+how the Cluster Autoscaler Operator will locate the MachineSet resources referenced by
+the MachineAutoscaler resource.
+
+Previously, only a Machine API MachineSet (i.e. a `MachineSet` kind in the
+`machine.openshift.io` API group) would be valid target of the `.spec.scaleTargetRef`
+field. After this enhancement is implemented, users may specify either a Machine
+API MachineSet or a Cluster API MachineSet in the `.spec.scaleTargetRef` field.
+
+For example, assume a cluster has the following MachineSet resources:
+
+```yaml
+apiVersion: machine.openshift.io/v1beta1
+kind: MachineSet
+metadata:
+  labels:
+    machine.openshift.io/cluster-api-cluster: example-cluster
+  name: example-cluster-worker-somezone-1
+  namespace: openshift-machine-api
+spec:
+  ...
+status:
+  authoritativeAPI: MachineAPI
+```
+
+```yaml
+apiVersion: machine.openshift.io/v1beta1
+kind: MachineSet
+metadata:
+  labels:
+    machine.openshift.io/cluster-api-cluster: example-cluster
+  name: example-cluster-worker-somezone-2
+  namespace: openshift-machine-api
+spec:
+  ...
+status:
+  authoritativeAPI: ClusterAPI
+```
+
+A user could then create the following MachineAutoscaler resources:
+
+```yaml
+apiVersion: "autoscaling.openshift.io/v1beta1"
+kind: "MachineAutoscaler"
+metadata:
+  name: "worker-somezone-1"
+  namespace: "openshift-machine-api"
+spec:
+  minReplicas: 1
+  maxReplicas: 12
+  scaleTargetRef:
+    apiVersion: machine.openshift.io/v1beta1
+    kind: MachineSet
+    name: example-cluster-worker-somezone-1
+```
+
+```yaml
+apiVersion: "autoscaling.openshift.io/v1beta1"
+kind: "MachineAutoscaler"
+metadata:
+  name: "worker-somezone-2"
+  namespace: "openshift-machine-api"
+spec:
+  minReplicas: 1
+  maxReplicas: 12
+  scaleTargetRef:
+    apiVersion: cluster.x-k8s.io/v1beta1
+    kind: MachineSet
+    name: example-cluster-worker-somezone-2
+```
+
+Note that the MachineAutoscaler named "worker-somezone-1" is targeting a Machine API
+MachineSet while "worker-somezone-2" is targeting a Cluster API MachineSet. The
+Cluster Autoscaler Operator will know by the `apiVersion` field whether to look
+for the resource in the `openshift-machine-api` or `openshift-cluster-api` namespace
+respectively.
+
+To support existing user experiences and workflows, the user might also have created
+the "worker-somezone-2" MachineAutoscaler using the Machine API reference as follows:
+
+```yaml
+apiVersion: "autoscaling.openshift.io/v1beta1"
+kind: "MachineAutoscaler"
+metadata:
+  name: "worker-somezone-2"
+  namespace: "openshift-machine-api"
+spec:
+  minReplicas: 1
+  maxReplicas: 12
+  scaleTargetRef:
+    apiVersion: machine.openshift.io/v1beta1
+    kind: MachineSet
+    name: example-cluster-worker-somezone-2
+```
+
+The Cluster Autoscaler Operator will update the Machine API MachineSet resource, and
+then the MachineSet sync controller will sync the change to the Cluster API MachineSet
+resource. The sync controller will use the managed fields (i.e. `.metadata.managedFields`)
+of the specified MachineSet to determine if the Cluster Autoscaler Operator made
+changes to the annotations, and then replicate those appropriately. In this manner,
+a user might specify an authoritative or non-authoritative MachineSet in the
+`scaleTargetRef` and the sync controller will be able to properly mirror the
+changes by detecting that the Cluster Autoscaler Operater has authored the changes.
+
+#### MachineAutoscaler Status Changes
+
+Another implementation detail of this enhancement that users will interact with is the
+addition of Kubernetes conditions to the MachineAutoscaler status field
+(i.e. `.status.conditions`).
+
+Historically, the MachineAutoscaler resource has not contained information to aid users in
+determining the state and health of that scaling group. With this enhancement however, it
+is possible that users might encounter issues when interacting with non-authoritative
+MachineSet resources. To improve the user experience, a conditions field will be added
+to the MachineAutoscaler status field (i.e. `.status.conditions`).
+
+There will be a single condition to indicate the ready status of the MachineAutoscaler. Under
+normal operation this condition will appear similar to this example:
+
+```yaml
+status:
+  conditions:
+    - lastTransitionTime: 2025-01-01T00:00:00Z
+      message: MachineAutoscaler ready for autoscaling
+      reason: MachineAutoscalerReady
+      status: "True"
+      type: Ready
+```
+
+In a case where the user has specified the same MachineSet in multiple MachineAutoscaler resources,
+the condition would appear as follows:
+
+```yaml
+status:
+  conditions:
+    - lastTransitionTime: 2025-01-01T00:00:00Z
+      message: targetScaleRef has multiple MachineAutoscaler owners
+      reason: MachineAutoscalerDuplicateTargetScaleRef
+      status: "False"
+      type: Ready
+```
+
+In this manner, errors in user configuration will be exposed through the MachineAutoscaler
+resource to assist users in diagnosing error states.
+
+#### MachineSet Sync Controller Changes
+
+The behavior of the MachineSet sync controller is another focus of implementation detail.
+Specifically, the MachineSet annotations and the replicas field of the spec
+(i.e. `.spec.replicas`). As mentioned previously, the MachineSet sync controller will
+use the managed fields metadata to know who, or what, has updated a field. Based on the field,
+and the author of the update, the sync controller will either propagate the change
+or synchronize with the authoritative resource.
+
+The updates that the MachineSet sync controller will watch for fall into a few different
+categories:
+
+* `.metadata.annotations` changes will be synced from the authoritative resource to the
+  non-authoritative except in these cases:
+  * The Cluster Autoscaler Operator has added the minimum and maximum size annotations, and ownership
+    annotation to a record. If the sync controller sees an update to these annotations on a
+    non-authoritative resource originating from the Cluster Autoscaler Operator, it will copy
+    that change to the authoritative resource if no MachineAutoscaler is referencing the
+    authoritative resource.
+  * A provider MachineSet controller has added the scale from zero annotations to a
+    non-authoritative record. This occurs when the Cluster API resource is marked as
+    authoritative but the Machine API resource is updated by the provider MachineSet controller.
+    In these cases the scale from zero annotations will be copied to the non-authoritative
+    Cluster API resource. The data from the MachineSet controller is only applied to
+    Machine API resources currently.
+* `.spec.replicas` changes will be synced from the Cluster API MachineSet to the Machine
+  API MachineSet regardless of which is authoritative when the change originates from the
+  Cluster Autoscaler. As the Cluster Autoscaler will be configured to operate against Cluster
+  API resources only, there will be a need to identify when the Cluster Autoscaler has updated
+  a non-authoritative Cluster API resource so that the authoritative resource can be updated.
+  This will only occur when the sync controller observes and update to the replicas field from
+  the Cluster Autoscaler.
+
+### Risks and Mitigations
+
+One of the possible risks associated with this implementation is conflicting
+MachineAutoscaler resources; where a user has created two resources
+that reference both the authoritative and non-authoritative MachineSet. This
+could create a race condition where updating the minimum and maximum size
+values will lead to an inaccurate update to both MachineSets.
+
+To address the risk of possible race conditions on MachineAutoscalers we have a few
+options:
+
+* Only allow MachinceAutoscaler target scale references to specify the authoritative MachineSet.
+  This would prevent users from using the non-authoritative references in their
+  MachineAutoscalers, but _may_ be perceived as a regression in user experience. Enabling this
+  would require the Cluster Autoscaler Operator to inspect the MachineAutoscaler target
+  references to ensure that only the authoritative resources are referenced.
+* Implement precedence rules within the Cluster Autoscaler Operator so that it can choose
+  the proper MachineAutoscaler resource based on the authoritative reference. This would
+  allow users to create multiple MachineAutoscalers which could reference the
+  authoritative and non-authoritative MachineSet resources without creating a race
+  condition on the MachineSet updates. Enabling this would require building the
+  precedence logic into the Cluster Autoscaler Operator, and updating the user
+  documentation to explain the precedence rules.
+
+#### Security
+
+This change will allow users to specify MachineSet references that will exist in two
+namespaces. Although the entry is not free form, and the reference will be validated
+by the Cluster Autoscaler Operator, the OpenShift Cloud Infrastructure team has
+reviewed this design for possible security vulnerabilities and considers the risk low.
+
+Before releasing this change, it will be reviewed by product security for final analysis.
+
+#### User Experience
+
+This enhancement describes a small change in the user experience with most of the
+feature work landing in the Cluster Autoscaler Operator. The Cloud Infrastructure has
+designed this change to follow the current user experience without introducing
+new concepts for enabling autoscaling on any specific MachineSet resource.
+
+The addition of conditions to the status will improve the user experience by giving
+direct feedback on warning and errors that may arise during operation. The Cloud
+Infrastructure team has designed this change to reduce confusion for users and to
+provide a clear path for triage.
+
+Outside of the Cluster Autoscaler Operator, users will need to understand that the
+Cluster Autoscaler is now focused on Cluster API MachineSets instead of Machine API
+MachineSets. This change will require users to understand the authoritative and
+mirroring mechanisms around the Cluster API migration to maximize their operation
+and maintenance activities.
+
+These changes should be clear for users, but to help mitigate the risk of misinterpretation,
+the user experience focused changes in this enhancement will be documented in the
+official OpenShift documentation.
+
+### Drawbacks
+
+A drawback of the approach defined in this enhancement is that it could cause
+confusion among users. The change in API behavior introduced in this enhancement
+will allow users to specify the MachineSet resource of their choice when
+enabling autoscaling. Due to the mirrored nature of the Machine API and Cluster API
+MachineSets, users may not immediately comprehend which resource they should
+specify, and what workflow is recommended by Red Hat.
+
+To reduce confusion, the changes described in this enhancement will be reviewed in
+the OpenShift product documentation, with examples to clarfiy the intended usage.
+A post on the OpenShift developer blog (https://developers.redhat.com/blog) to
+describe this feature in greater depth could be added to any Cluster API change
+post threads to help add informational material.
+
+Another approach to reducing confusion would be to allow only a single type of
+MachineSet resource (Machine API or Cluster API) to be specified as a target for
+autoscaling. This approach could work if the Cluster API resources are chosen as
+the target, but would represent a hard shift in the current MachineAutoscaler
+behavior and would require a conversion migration for all upgrades where cluster
+autoscaler is in use.
+
+As the behavior described in this enhancement is making an extension to current
+workflows, without changing the ingestion API, the drawbacks to implementing this
+enhancement are low. By contrast, to enable cluster autoscaling for Cluster API
+resources _without_ this enhancement will require a redesign of how we express
+autoscaling in OpenShift.
+
+## Open Questions
+
+1. Will we encounter race conditions when dealing with synchronizing the autoscaling
+  metadata between authoritative and non-authoritative resources with the Cluster
+  Autoscaler Operator and MachineSet sync controller?
+2. Will we want to remove MachineAutoscalers that reference Cluster API MachineSets
+  during a downgrade?
+
+## Test Plan
+
+The current autoscaler end to end tests will be expanded to cover the new Cluster
+API specific functionality. The business logic of our current cluster autoscaler
+tests will continue to have relevance, the primary addition will be testing the
+authoritative/non-authoritative mechanisms described in this enhancement.
+
+Unit tests will be added to the Cluster Autoscaler Operator and MachineSet sync
+controller to exercise the precendence and mirroring logic.
+
+## Graduation Criteria
+
+The foundational technology that this enhancement is based on (cluster autoscaling) is
+already considered to be "GA" in OpenShift. The changes described in this enhancement
+will be dependent on the addition of Cluster API to OpenShift; the required
+functionality is summared in these enhancements:
+
+* [Cluster API Integration][capi-integ-enh]
+* [Installing Cluster API components in OpenShift][capi-inst-enh]
+* [Converting Machine API resources to Cluster API][capi-conv-enh]
+
+The features described in this enhancement will follow graduation of their parent
+enhancements, and will use a feature gate to ensure that the features are not
+released prematurely.
+
+### Dev Preview -> Tech Preview
+
+- Ability to use either Machine API or Cluster API MachineSets with MachineAutoscalers
+- Cluster Autoscaler uses only Cluster API MachineSet resources
+- End user documentation
+- Sufficient test coverage, end to end and unit
+- Gather feedback from users rather than just developers
+- Write symptoms-based alerts for the component(s)
+
+### Tech Preview -> GA
+
+- More testing (upgrade, downgrade, scale)
+- Available by default
+- Conduct load testing
+- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+### Removing a deprecated feature
+
+- Announce deprecation and support policy of the existing feture
+- Deprecate the feture
+
+## Upgrade / Downgrade Strategy
+
+For upgrades, the features described in this enhancement do not pose a risk for
+users nor do they require conversions. Any MachineAutoscaler resources currently
+referencing Machine API MachineSet resources will continue to work as expected
+after upgrade.
+
+Given that there will be no Cluster API MachineSet resources in the cluster prior
+to upgrade, there could be no MachineAutoscaler referencing a Cluster API MachineSet.
+This means that no upgrade conversion or migration is required for this scenario.
+
+For downgrades, the features described in this enhancement might cause some
+MachineAutoscalers to enter error states. If a MachineAutoscaler is referencing a
+Cluster API MachineSet before a downgrade, and is not removed, it will cause
+the Cluster Autoscaler Operator to report errors in its logs for those
+MachineAutoscaler resources. Additionally, these MachineAutoscaler resources
+will not be able to updated as the calls to update the previous scale target
+may not be able to find the Cluster API MachineSet resource. In all other respects
+a downgrade will be benign as the Cluster API resources will no longer be
+reconciled by OpenShift operators, see
+[Upgrade/Dowgrade Strategy of conversion enhancement][capi-conv-enh-ud].
+
+## Version Skew Strategy
+
+The changes described in this enhancement are limited to the MachineAutoscaler
+API resource, the Cluster Autoscaler Operator, the Cluster Autoscaler, and the
+MachineSet sync controller. As the Cluster Autoscaler Operator is the only controller
+that reconciles the MachineAutoscaler resource, a skew in versions with other
+OpenShift components represents a low risk for adverse behavior.
+
+During an upgrade, configured MachineAutoscalers will continue to work as expected.
+There is a possibility for a gap in autoscaling services when cluster conditions
+have a skew with the Cluster Autoscaler. In scenarios where the Cluster Autoscaler is
+skewed behind the version of the Cluster Autoscaler Operator, it is possible that
+Cluster API MachineSets will not be processed by the Cluster Autoscaler. This condition
+would not make the cluster unusable and would be cleared when the Cluster Autoscaler
+is updated.
+
+## Operational Aspects of API Extensions
+
+The Cluster Autoscaler Operator will continue to be deployed in the `openshift-machine-api`
+namespace. Previously it had only inspected resources (ClusterAutoscaler, MachineAutoscaler,
+and MachineSet) in the same namespace. After this enhancement is implemented the
+Cluster Autoscaler Operator will be able to inspect MachineSet resources in the
+`openshift-cluster-api` namespace as well. The decision of which namespace to use when
+inspecting records is determined by the API group version of the `targetScaleRef`
+field in a MachineAutoscaler resource.
+
+Changes to the Cluster Autoscaler will not have a marked change in user experience. The
+expected workflows for operation, maintenance, and triage will continue to follow
+proscribed patterns for OpenShift. The key difference is that the Cluster Autoscaler
+will only integrate with Cluster API MachineSets after this enhancement is implemented.
+
+The failure modes around the changes in this enhancement will be focused on the behavior
+of the MachineAutoscaler API type and the Cluster Autoscaler. The most common failures
+will come from incorrectly configured MachineAutoscaler resources: either referencing a
+previously referenced MachineSet, or referencing a non-existent MachineSet. A less common
+failure could be possible through user intervention that would overwrite the autoscaling
+related API fields. In both of these cases, the effects on the cluster will be that some
+autoscaling related activities might not work as expected. These failure modes will not
+cause a disruption to holistic cluster usage.
+
+## Support Procedures
+
+The first point of failure when using the features described in this enhancement will be
+the MachineAutoscaler resource. During a failure scenario, the expected symptoms include:
+
+* The target MachineSet not being scaled by the cluster autoscaler.
+* The target MachineSet not having the minimum and maximum scale annotations (`cluster-api-autoscaler-node-group-min-size` and `cluster-api-autoscaler-node-group-max-size`).
+* The `Ready` condition on the MachineAutoscaler being set to `False`.
+
+These symptoms will not affect crucial cluster operation, but may impact cluster autoscaling
+capabilities for the MachineSets specified in the MachineAutoscaler resources.
+
+To triage issues related to the symptoms, users should follow these steps:
+
+1. Inspect the failing MachineAutoscaler's conditions in the status field. This will contain
+  a detailed message about the failure.
+1. Confirm the details of the `targetScaleRef` field in the failing MachineAutoscaler.
+1. Search the logs for the Cluster Autoscaler Operator for warnings and errors related to the
+  failing MachineAutoscaler.
+1. Search the logs for the Cluster Autoscaler for warnings and errors related to the
+  MachineSet targetted by the MachineAutoscaler.
+
+It is safe to delete and recreate a failing MachineAutoscaler without disrupting cluster operation.
+
+## Alternatives
+
+### Cluster Autoscaler
+
+#### Recognize Both Machine API and Cluster API Resource Groups
+
+The Cluster Autoscaler could be updated to read both Machine API and Cluster API MachineSets.
+This would require adding a patch to the autoscaler that will carry the logic for reading and
+writing both API groups’ resources, and will be able to distinguish between the authoritative
+and non-authoritative resources. This alternative requires carrying a significantly complex
+patch on our fork of the autoscaler, some of this code might be shared with the Cluster API
+MachineSet sync controller and could be extracted to a common library.
+
+This alternative has been dismissed as a possible implementation due to the complexity
+associated with carrying the extra patches on the Cluster Autoscaler, and the
+likelyhood of race conditions with the MachineSet synchronization workflow.
+
+#### Only Recognize Machine API Resource Groups
+
+This is how we run the autoscaler today, it would require us to continue carrying our
+Machine API patches indefinitely. This change would keep the autoscaler in its current state
+and require the Cluster API MachineSet sync controller to recognize when the autoscaler has
+changed the Machine API resources and propagate them accordingly. This possibility has a
+large gap in that it will not be useful in situations where a Cluster API resource exists
+without a corresponding Machine API resource.
+
+This alternative has been dismissed as a possible implementation due to it not addressing
+user workflows that only include Cluster API MachineSets without corresponding Machine API
+MachineSets.
+
+### Cluster Autoscaler Operator
+
+#### No Change
+
+Currently the operator does not discriminate between the API groups and versions that are
+referenced in the MachineAutoscaler. This means that it should continue to function as
+expected regardless of whether it references Machine API or Cluster API resources. A
+downside to this approach is that the operator looks for resources in its own namespace
+(currently openshift-machine-api), and the Cluster API related resources will be in a
+different namespace. This means that there would need to be a Machine API MachineSet
+equivalent for any Cluster API MachineSet that a user might want to scale. The Cluster
+API MachineSet sync controller will need to be updated to ensure that when the Cluster
+Autoscaler Operator adds the autoscaling annotations that they are copied to any related
+resources, regardless of which is authoritative.
+
+This alternative has been dismissed as a possible implementation due to the limitation
+of requiring all resources to exist in the openshift-machine-api namespace. This
+limitation will prevent user workflows where only Cluster API MachineSets exist.
+
+#### Only Allow Machine API Targets
+
+The Cluster Autoscaler Operator could be updated to only allow the use of Machine API
+references as targets. This would mean restricting the functionality of the operator
+to only recognize Machine API resources. The Cluster API MachineSet sync controller
+would need to be updated to recognize the Machine API MachineSet changes and copy them
+into the Cluster API resources regardless of which resource is authoritative.
+
+This alternative has been dismissed as a possible implementation due to it not addressing
+user workflows that only include Cluster API MachineSets without corresponding Machine API
+MachineSets.
+
+#### Only Allow Cluster API Targets
+
+Similar to the previous alternative, the Cluster Autoscaler Operator could be updated
+to only allow the use of Cluster API references as targets. This alternative would
+require a similar level of change as the previous option with the notable exception
+being the target of action, and allowing the operator to view resources in the
+openshift-cluster-api namespace.
+
+This alternative has been dismissed as a possible implementation due to the regression
+in user experience that it would impose. This alternative would require all
+MachineAutoscalers to convert their target references to use the Cluster API version of
+any MachineSet to continue inclusion in autoscaling. This conversion could be done
+through automation, but the regression in user experience is considered high enough
+to dismiss this option.
+
+### Namespaces and Resources
+
+#### Migrate to the Cluster API Namespace
+
+Migrate the Cluster Autoscaler and Cluster Autoscaler Operator to the new
+openshift-cluster-api namespace. This change would require that we change the
+deployment artifacts for the Cluster Autoscaler and Cluster Autoscaler Operator to the
+new location. Due to the fact that much automation is built on top of the current locations
+and namespaces, this possibility will require extensive documentation and migration
+information for users.
+
+This alternative has been dismissed as a possible implementation at this time due to
+the transitive toil it would create in adjusting all the build and test workflows. This
+may be reconsidered in the future but is not included in this enhancement.
+
+#### Migrate Autoscaling to Its Own Namespace
+
+Migrate the Cluster Autoscaler and Cluster Autoscaler Operator to their own namespace,
+for example `openshift-cluster-autoscaling`. This option requires a similar level of effort
+as moving to the openshift-cluster-api namespace but carries the possible advantage of
+allowing us to better separate distinct controllers in OpenShift.
+
+Similar to the previous alternative, this has been dismissed as a possible implementation
+at this time due to the transitive toil it would create in adjusting all the build and
+test workflows. This may be reconsidered in the future but is not included in this enhancement.
+
+
+[cas-repo]: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
+[cas-ocp]: https://docs.openshift.com/container-platform/4.17/machine_management/applying-autoscaling.html
+[hcp-nl]: https://github.com/openshift/enhancements/blob/master/enhancements/hypershift/node-lifecycle.md
+[capi-inst-enh]: https://github.com/openshift/enhancements/blob/master/enhancements/cluster-api/installing-cluster-api-components-in-ocp.md
+[capi-integ-enh]: https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/cluster-api-integration.md
+[capi-conv-enh]: https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/converting-machine-api-to-cluster-api.md
+[capi-conv-enh-ud]: https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/converting-machine-api-to-cluster-api.md#upgrade--downgrade-strategy