[OSD-15261] CPMS: allow automatic vertical scaling. #1506

bergmannf · 2023-10-30T08:29:30Z

New proposal to allow the controlplane-machinset-operator to automatically scale control plane nodes.

This is based on OSD-15261 as an enhancement to automatically scale control plane nodes.

openshift-ci · 2023-10-30T08:29:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign frobware for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2023-10-30T10:09:56Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+control plane nodes.
+
+During SRE operations of the managed OpenShift product, there was already a
+trigger indentified, when these increases are required. Instead of requiring the


What triggers are we talking about? Are the scale up/down triggers documented in this EP?

The basis for the underlying OSD card are these SREP alerts: https://github.com/openshift/managed-cluster-config/blob/master/deploy/sre-prometheus/100-control-plane-resizing.PrometheusRule.yaml
However, based on the Slack discussion those would depend on Prometheus alerts & metrics, so this tries to recapture (and make configurable) the basis of the alerts.
Thinking about more triggers, it might be a good idea to also use something like https://docs.openshift.com/container-platform/4.13/scalability_and_performance/recommended-performance-scale-practices/recommended-control-plane-practices.html as a 'trigger' - especially worker node counts.

JoelSpeed · 2023-10-30T10:11:11Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+case the operator should also be able to automatically decide to scale the
+control plane down again.
+
+With the automatic increase of the control plane size, cluster performance can


Have we considered how to avoid flapping? So that we aren't in a constant loop of scale up/down as the capacity to usage ratios change during a rollout?

The idea is to prevent hasty scale up and down using the two configuration variables syncPeriod and gracePeriod.
SyncPeriod is the minimum amount of time the average load threshold must be over or under the scaleup and down threshold, before an actual change is performed.
GracePeriod will then prevent a new scaling to occur, if the last scale event was trigger sooner than the graceperiod.

@elmiko Can you add context on how the Cluster Autoscaler adds hysteresis, I think it would be good to align

JoelSpeed · 2023-10-30T10:12:45Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+
+### Goals
+
+* Allow automatic vertical scaling of control plane nodes.


As an alternative, have we investigated whether horizontal scaling would work? This strikes me as potentially easier from a scaling perspective, though we would need to understand the implications for the openshift control plane which, up to this point, has only supported 3 nodes

This is based on our workflow as SREP, where we never scale horizontally to adjust for increased load but only vertically.
So we know that vertical scaling will accommodate increased cluster load, but I am not aware if we have data about horizontal scaling as a working alternative.

Horizontal scaling should solve the issue. IIUC the issue tends to be that the API servers start getting hotter when there's an increased number of nodes/pods etc. Increasing the number of nodes horizontally would spread that load out and should therefore help.

That said, it's not currently supported by OpenShift, but, I don't really know the history of why. It may be worth asking PM/eng in a wider scope about the gaps and why we don't support horizontally scaling the control plane.

With etcd, scaling from 3 to 4 members raises the quorum requirement from 2 to 3 (because 2 of 4 isn't a strict majority in a split-network situation). So you can still only accept a single member failing, and now that you have four members that could each fail, the odds of having two down at any time is higher. So while you might handle read load better, your uptime will suffer, and folks generally consider uptime to be more important.

Scaling from 3 to 5 also raises the quorum from 2 to 3, but because you have five nodes, you can now lose two simultaneously while retaining quorum, which helps with uptime. But because Raft requires sharing to a majority of members before committing, the amount of etcd-to-etcd network traffic grows and write throughput decreases. Most folks want high write performance. And unless two member failures are very close together, the ControlPlaneMachineSet controller will be able to provision replacement members, and it's very rare to need the redundancy that a 5-member etcd cluster delivers.

However, that constraint is about etcd write speed, while a lot of cluster load is watches and lists and other reads. It may be possible to scale read-only etcd members, or to leave etcd at 3 but scale Kube API servers, or similar. But maybe there are reasons that wouldn't work. And even if it is possible to get something like that working, it would be a fairly large architecture pivot, while we have the CPMS tooling in place today, and only lack the bit that tweaks the instance type CPMS is being asked to roll out.

I think that's useful context @wking and, if there isn't an appetite within OCP to scale beyond 3 control planes (now that I know etcd operator can handle more than 3 replicas with learners), then it's a good reason to use vertical scaling. I just want to make sure this context gets into the doc so that others who see horizontal scaling dismissed understand why

JoelSpeed · 2023-10-30T10:13:59Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+
+### Non-Goals
+
+* Horizontal scaling of control plane nodes


I'd be curious, could you expand why this isn't a goal? Would horizontal scaling solve the problem or is it completely out of the question?

This is based on our current practices which do not scale control planes horizontally.
From the control plane machinset documentation I also saw that horizontal scaling is not handled by the operator, so I'm not sure we have an operator that can scale control planes horizontally right now?
So using horizontal scaling might work, but my concerns are:

We seem to have nothing in place to handle it right now

I am not aware of having used it in production, so it seems uncertain if it would solve the same problems that vertical scaling already does.
So it might help, but seems to have quite a few unknowns - maybe it could be added to 'alternatives' instead?

Yeah I think that would be a case for an alternative.

I think horizontal would solve the problems here, but you're right, it's currently untested. But this does seem like a good change to revisit whether it should become supported or not.

JoelSpeed · 2023-10-30T10:16:16Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+
+## Proposal
+
+New configuration and a reconcile loop for the


If this is going to be directly in the control plane machine set, then it's considered to be core functionality that we expect a significant number of users to leverage. Will this feature be built in a way that it isn't are specific and can be used on any vanilla openshift cluster without further configuration or additional operators being configured?

Any thought about whether this should be integrated into CPMS Vs being a separate add on?

I think having an add on would be fine as well - in that case the default case could be not having it installed, so this whole feature would be opt in.
In this case SREP managed clusters could opt in using our own automation that runs after cluster installation.

As an add on would still require the complete controlplane-machine-set operator I assume this could still be handled using this proposal?

Yeah I think whether this is built in or not, having an EP for it makes sense.

CC @sub-mod for PM input, do you have any thoughts on the demand for this being built into the cluster vs being an add on?

Will this feature be built in a way that it isn't are specific and can be used on any vanilla openshift cluster without further configuration...

It's hard for me to imagine us knowing enough to be able to default instancesSizes out of the box. But we ship the cluster autoscaler (MachineSets) and the horizontal pod autoscaler (deployments and such) baked into the OpenShift release payload. And we ship the custom metrics autoscaler (deployments and such) as a separate operator. So it seems like there is precedent both ways, and not a lot to decide between approaches. One benefit to a separate operator is that OLM packages can be uninstalled, while cluster-capabilities cannot be. So if we expect some very-resource-concious consumers to not need this particular functionailty, that might be enough of an argument to push it into a separate operator?

I had a conversation with my team at standup yesterday in length about this topic and we all came to the conclusion that having this initially as SD owned and then trying to deliver via OLM for those who want to have it in regular openshift clusters would be our preferred path forward.

I think that aligns with where others landed on this conversation too

JoelSpeed · 2023-10-30T10:28:21Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+- `instancesSizes`: a yaml object defining, the available machine sizes and an
+  ordering to determine the order in which to size up and down.
+- `instanceSizePath`: the path of the `instanceType` in the
+  `control-plane-machineset` that needs to be changed to initiate scaling. It


There is only ever one control plane machine set per cluster

This was based on the (wrong) assumption that only a single path in the spec needs to be updated.
As vSphere requires multiple fields I tried to rework how to configure this and this field becomes obsolete.

You may want to look at how we handle failure domains in the control plane machine set, that solution would probably work well here

JoelSpeed · 2023-10-30T10:28:57Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+  might be possible to either hardcode or autodetect this depending on the cloud
+  provider used.
+
+The **can** configurations will include the following properties:


Cc @elmiko how do these align with the cluster autoscaler?

JoelSpeed · 2023-10-30T10:30:44Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+[Node](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.27/#node-v1-core)
+API.
+
+As the window returned by metrics API is likely shorter than required window for


This is based on polling the metrics API at some pre defined (via user config?) Interval right?

Exactly - that interval should be configurable via the configuration.
In this proposal the property that would control it would be syncPeriod.

JoelSpeed · 2023-10-30T10:31:47Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+
+- Read access for `/apis/metrics.k8s.io/v1beta1/nodes`.
+- Read access for `/api/v1/nodes` (this seems already in place for OpenShift,
+  but has to be verified for a non-OpenShift installations).


Non openshift?

I am not sure how we handle vanilla Kubernetes clusters in this proposals, so I included that point in this list.
If vanilla Kubernetes installations can be ignored, I can just remove this one.
However in that case it might be better to say non-OSD installations, as that is the one I checked against.

Yeah we don't tend to worry about non-openshift. There's a difference between openshift, OSD, HyperShift, MicroShift to account for but, we can assume everything this proposal will apply to will be some kind of openshift

JoelSpeed · 2023-10-30T10:32:28Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+- Read access for `/api/v1/nodes` (this seems already in place for OpenShift,
+  but has to be verified for a non-OpenShift installations).
+
+### Open Questions


Are there any additional metrics other than those available from metrics API that a user may want to use to scale up?

What might be worth considering is, taking into account our control plane sizing guidelines for OpenShift based on number of workers in the cluster: https://docs.openshift.com/container-platform/4.13/scalability_and_performance/recommended-performance-scale-practices/recommended-control-plane-practices.html

wking · 2023-10-30T23:54:49Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+* Allow users to configure how long after a scale up or scale down, no more
+  scaling should be performed.
+* Allow users to completely disable scale up.
+* Allow users to completely disable scale down.


In the interest of simplicity, these two seem like they might be covered by "Allow users to configure the sizes that can be chosen for automatic scaling.". If your control plane is running type-b, you could drop the instance type list from [type-a, type-b, type-c] to just type-b to pin the cluster at the current type. Or you could raise the "Allow users to configure how long after a scale up or scale down, no more scaling should be performed." threshold to infinity. Perhaps there are additional constraints, and these are "...without having to check to see what instance the control plane is currently using"? Or "...and disabling should pause any in-progress CPMS roll-outs"? Or perhaps these entries can be dropped, and the other existing goals are sufficient?

I was thinking of keeping these in, for the users that might never want to automatically scale down and only up for example.
I am not sure if that edge case is enough to have this out of the box, because generally I like the idea of having less configuration variables.
Moved this to the selectPolicy field for now with just "enabled" and "disabled" values for the start.

I agree that there is separate value in disabling scale up/scale down with its own lever. Having to remove the configuration of the different machine profiles to disable scale up/down feels like a bit of an awkward way to hold the tool, and it's not what we do in other autoscaling APIs to my knowledge

I moved the whole decision making logic to scale up and down to what Custom Metrics Operator calls triggers.
with this approach, specifying no triggers would disable the respective option (scale up or scale down) completely.
I moved to the trigger approach, because CMO also allows the metrics API and prometheus as a source to trigger scaling, which is exactly what we want here as well. I tried to keep as many properties as possible the same, so users of one of the operators might have an easy time to also setup the other.

wking · 2023-10-31T00:31:45Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+#### Risk 1: Higher costs when scaling up
+
+Scaling up will increase costs for running the cluster, as bigger nodes will
+incur higher costs from the cloud provider.


probably worth copying down "and mitigations" here, like explaining how instancesSizes sets a cap on the largest acceptable size. And possibly adding alerts around "you've pegged this autoscaler" (...and we're currently ok with that, but you may not be? ...and we currently wish we had a higher size available, because we're swamped! Other subclasses?).

wking · 2023-10-31T00:39:02Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+- `gracePeriod`: how long after a successful scaling should no more scaling be
+  performed, even if the thresholds would scale the control planes again.
+- `scaleDownEnabled`: can enable or disable scaling down.
+- `scaleUpEnabled`: can enable or disable scaling up.


The horizontal pod autoscaler has:

scaleUp: selectPolicy: Disabled

which sets up a nice discriminated union pattern that allows convenient extensibility if later maintainers think of new schemas for either scale-up or scale-down.

Thanks for the great link - I've also used the same pattern for the cloud provider specific machine configuration now.

JoelSpeed · 2023-11-01T11:29:55Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+approvers:
+  - "@JoelSpeed"
+api-approvers: 
+  - None


I expect this project needs an API based on some prior discussion, you can assign me as the API approver here

JoelSpeed · 2023-11-01T11:45:34Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+To allow control plane nodes to remain at an adequate size for its cluster, this
+proposal introduces new configuration for the control plane machineset operator
+to allow it to make automated scaling decisions for control plane node sizing.


Paragraph needs re-wording based on most recent conversation about having this as something separate that leverages CPMS.

I think we agreed that we would start with a SD based operator and then later maybe put it on OLM?

JoelSpeed · 2023-11-01T11:59:42Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+1. The OpenShift adminstrator creates a valid
+   `control-plane-machine-set-autoscaling` `CR` in the `openshift-machine-api`
+   namespace (or the respective namespace `control-plane-machineset-operator` is
+   running in), to configure the automatic vertical scaling.


For consistency with MachineAutoscaler which already exists

Suggested change

1. The OpenShift adminstrator creates a valid

`control-plane-machine-set-autoscaling` `CR` in the `openshift-machine-api`

namespace (or the respective namespace `control-plane-machineset-operator` is

running in), to configure the automatic vertical scaling.

1. The OpenShift adminstrator creates a valid

`control-plane-machine-set-autoscaler` `CR` in the `openshift-machine-api`

namespace (or the respective namespace `control-plane-machineset-operator` is

running in), to configure the automatic vertical scaling.

JoelSpeed · 2023-11-01T12:00:42Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+
+### API Extensions
+
+This proposal requires a new custom resource to configure when and how


Do you want to define the API here? We do commonly review APIs within enhancements so that there's some sense of the UX of the proposal

JoelSpeed · 2023-11-01T12:01:54Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+As every OpenShift cluster comes with the [metrics
+API](https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/)
+as well as
+[prometheus](https://docs.openshift.com/container-platform/4.13/monitoring/monitoring-overview.html)
+installed, both should be available as possible data sources.


What about on clusters that don't have prometheus? I was under the impression that it's an optional capability right?

I tried to make it clearer that if Prometheus isn't configured to be used as a trigger it should work even if prometheus isn't running, as it isn't supposed to use that datasource.

JoelSpeed · 2023-11-02T12:56:21Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+      - `query`: Specifies the Prometheus query to use.
+      - `authModes`: Specifies the authentication method to use. Should support
+        at least *basic*, *bearer* and *tls* authentication.
+      - `ignoreNullValues`: false 


What does this do? Booleans are generally forbidden in openshift APIs in favour of using enums. Bools do not age well typically

JoelSpeed · 2023-11-02T12:56:23Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+      - `authModes`: Specifies the authentication method to use. Should support
+        at least *basic*, *bearer* and *tls* authentication.
+      - `ignoreNullValues`: false 
+      - `unsafeSsl`: Specifies whether the certificate check should be skipped.


Is this definitely something we want to support? Typically the field name for this would be something like insecureSkipTLSVerify, might be nice to be consistent.

Shouldn't be a bool btw. Would prefer to see, if at all possible, the ability to specific a CA certificate to verify against rather than allowing to skip verification completely

It's coped from the CMA operator (https://keda.sh/docs/2.12/scalers/prometheus/#authentication-parameters) but I can also adjust it, if we are fine not having a 1:1 match to the fields in some cases.

Yeah I don't think we need to match their API, looks like it hasn't been through API review so I would prefer we stick to OpenShift conventions rather than some third party project

JoelSpeed · 2023-11-02T12:56:33Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+      - type: "cpu"
+        value: "80"
+        timeWindow: "30m"


This should be a discriminated union, so

- type: "cpu" cpu: value: "80" timeWindow: "30m"

JoelSpeed · 2023-11-02T12:56:39Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+- `secretTargetRef`:
+  - `parameter`: type of the secret referenced: should be `bearer` for bearer
+    authentication.
+  - `name`: name of the secret too use.
+  - `key`: key in the secret that contains the token.


Why not include this directly? If you do include a separate CR for this (not an uncommon pattern, seen that before), I would at least expect that there is a reference to the authentication CR in the prometheus trigger struct

I based this on https://keda.sh/docs/2.12/scalers/prometheus/#authentication-parameters - same as the trigger configuration for consistency, but I can also move it directly into the trigger configuration.

JoelSpeed · 2023-11-02T12:57:03Z

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

+Both cases should use a conservative approach of not performing any actions, and
+not interrupting cluster operation.


Will you expect this to cause the operator to mark itself as degraded? How as an end user would I know that the scaling is not working as I expect it to

openshift-bot · 2024-01-09T01:15:50Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

bergmannf · 2024-01-09T07:17:34Z

/remove-lifecycle stale

New proposal to allow the controlplane-machinset-operator to automatically scale control plane nodes.

Triggers are used in CMA to allow metrics API & prometheus as sources for scaling descisions. Most of the changes try to adapt the configuration from CMA, so users don't have to get used to a completely new API if they already use CMA.

bergmannf · 2024-01-10T13:52:40Z

I've updated the doc by going through all comments again (I think I only missed one) - I'm wondering if there are any big things we might have to address, or if we can already start prototyping out the operator from the SRE side?
I just don't want to get this prioritized before we are relatively stable on the basic idea / API.

enhancements/machine-api/control-plane-machineset-vertical-scaling.md

openshift-ci · 2024-01-18T09:34:40Z

@bergmannf: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

dhellmann · 2024-02-13T16:30:34Z

#1555 is changing the enhancement template in a way that will cause the header check in the linter job to fail for existing PRs. If this PR is merged within the development period for 4.16 you may override the linter if the only failures are caused by issues with the headers (please make sure the markdown formatting is correct). If this PR is not merged before 4.16 development closes, please update the enhancement to conform to the new template.

openshift-bot · 2024-03-13T01:15:21Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

JoelSpeed · 2024-03-13T10:46:36Z

@bergmannf Is this still a priority? Do we need another round of reviews?

bergmannf · 2024-03-15T08:21:02Z

@JoelSpeed Thanks for circling back to this - right now our focus is on simply enabling CPMS across the managed fleet.
We do have an Epic that would rely on this proposal (https://issues.redhat.com/browse/SDE-3786) - however I can't say when/if this will be prioritized.
So I think this PR might still be relevant but not urgent - so I would be great to have it ready, but I also don't want to make you use your time to review it, when we aren't sure when we will be able to get back to this.

openshift-bot · 2024-03-22T08:45:25Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2024-03-30T00:15:49Z

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2024-03-30T00:16:11Z

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann · 2024-04-05T13:05:17Z

(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, OSD-15261, has status "Closed, Obsolete". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

dhellmann · 2024-04-05T13:11:30Z

(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, OSD-15261, has status "Closed, Obsolete". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

dhellmann · 2024-04-12T14:08:42Z

(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, OSD-15261, has status "Closed, Obsolete". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

dhellmann · 2024-04-19T12:46:01Z

(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, OSD-15261, has status "Closed, Obsolete". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

dhellmann · 2024-04-26T12:56:09Z

(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, OSD-15261, has status "Closed, Obsolete". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future.

openshift-ci bot requested review from celebdor and eparis October 30, 2023 08:29

bergmannf force-pushed the control-plane-machinset-vertical-scaling branch 5 times, most recently from c3d3d8d to 4ba2d3a Compare October 30, 2023 10:27

JoelSpeed reviewed Oct 30, 2023

View reviewed changes

bergmannf force-pushed the control-plane-machinset-vertical-scaling branch from 4ba2d3a to 243eef5 Compare October 30, 2023 12:14

wking reviewed Oct 30, 2023

View reviewed changes

wking reviewed Oct 31, 2023

View reviewed changes

bergmannf force-pushed the control-plane-machinset-vertical-scaling branch 10 times, most recently from 08b9f15 to 24fc5ea Compare October 31, 2023 15:27

JoelSpeed reviewed Nov 2, 2023

View reviewed changes

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2024

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2024

bergmannf added 2 commits January 10, 2024 10:04

CPMS: allow automatic vertical scaling.

4286b47

New proposal to allow the controlplane-machinset-operator to automatically scale control plane nodes.

Rework thresholds to triggers.

fec1c3d

Triggers are used in CMA to allow metrics API & prometheus as sources for scaling descisions. Most of the changes try to adapt the configuration from CMA, so users don't have to get used to a completely new API if they already use CMA.

bergmannf added 4 commits January 10, 2024 10:04

Integrate review comments into document.

c43cd2d

Add first API Extension design.

6f44178

Add additional review comments.

015e439

Fix some grammatical errors and wording.

d61e83f

bergmannf force-pushed the control-plane-machinset-vertical-scaling branch from 9376930 to d61e83f Compare January 10, 2024 09:23

Add hysteris information to enhancement.

7ecc1be

iamkirkbater reviewed Jan 17, 2024

View reviewed changes

enhancements/machine-api/control-plane-machineset-vertical-scaling.md Outdated Show resolved Hide resolved

Fix API types.

f934416

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 13, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 22, 2024

openshift-ci bot closed this Mar 30, 2024


		### Goals

		* Allow automatic vertical scaling of control plane nodes.


		### API Extensions

		This proposal requires a new custom resource to configure when and how

		Both cases should use a conservative approach of not performing any actions, and
		not interrupting cluster operation.

[OSD-15261] CPMS: allow automatic vertical scaling. #1506

[OSD-15261] CPMS: allow automatic vertical scaling. #1506

Conversation

bergmannf commented Oct 30, 2023 • edited by openshift-ci bot Loading

openshift-ci bot commented Oct 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergmannf Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergmannf Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergmannf Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergmannf Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

JoelSpeed Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

bergmannf Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-bot commented Jan 9, 2024

bergmannf commented Jan 9, 2024

bergmannf commented Jan 10, 2024

openshift-ci bot commented Jan 18, 2024

dhellmann commented Feb 13, 2024

openshift-bot commented Mar 13, 2024

JoelSpeed commented Mar 13, 2024

bergmannf commented Mar 15, 2024

openshift-bot commented Mar 22, 2024

openshift-bot commented Mar 30, 2024

openshift-ci bot commented Mar 30, 2024

dhellmann commented Apr 5, 2024 • edited by openshift-ci bot Loading

dhellmann commented Apr 5, 2024 • edited by openshift-ci bot Loading

dhellmann commented Apr 12, 2024 • edited by openshift-ci bot Loading

dhellmann commented Apr 19, 2024 • edited by openshift-ci bot Loading

dhellmann commented Apr 26, 2024 • edited by openshift-ci bot Loading

bergmannf commented Oct 30, 2023 •

edited by openshift-ci bot

Loading

bergmannf Oct 30, 2023 •

edited

Loading

bergmannf Oct 30, 2023 •

edited

Loading

bergmannf Oct 30, 2023 •

edited

Loading

bergmannf Oct 31, 2023 •

edited

Loading

JoelSpeed Oct 31, 2023 •

edited

Loading

bergmannf Oct 31, 2023 •

edited

Loading

dhellmann commented Apr 5, 2024 •

edited by openshift-ci bot

Loading

dhellmann commented Apr 5, 2024 •

edited by openshift-ci bot

Loading

dhellmann commented Apr 12, 2024 •

edited by openshift-ci bot

Loading

dhellmann commented Apr 19, 2024 •

edited by openshift-ci bot

Loading

dhellmann commented Apr 26, 2024 •

edited by openshift-ci bot

Loading