From ba3a63cde0fc62ec66fd2cd9885d9615544e50a9 Mon Sep 17 00:00:00 2001 From: Karthik K N Date: Thu, 13 Apr 2023 09:29:19 +0530 Subject: [PATCH 01/19] KEP for dynamic node resize --- .../3953-dynamic-node-resize/README.md | 732 ++++++++++++++++++ .../3953-dynamic-node-resize/kep.yaml | 33 + 2 files changed, 765 insertions(+) create mode 100644 keps/sig-node/3953-dynamic-node-resize/README.md create mode 100644 keps/sig-node/3953-dynamic-node-resize/kep.yaml diff --git a/keps/sig-node/3953-dynamic-node-resize/README.md b/keps/sig-node/3953-dynamic-node-resize/README.md new file mode 100644 index 00000000000..100d067f1c4 --- /dev/null +++ b/keps/sig-node/3953-dynamic-node-resize/README.md @@ -0,0 +1,732 @@ +# KEP-3953: Node dynamic resize + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This proposal aims at enabling dynamic node resizing. This will help in resizing cluster resource capacity by just updating resources of nodes rather than adding new node or removing existing node and +also enable node configurations to be reflected at the node and cluster levels automatically without the need to manually resetting the kubelet + +This proposal also aims to improvise the initialisation and reinitialisation of resource managers like cpu manager, memory manager with the dynamic change in machine's CPU and memory configurations. + +## Motivation +In a typical Kubernetes environment, the cluster resources may need to be altered because of various reasons like +- Incorrect resource assignment while creating a cluster. +- Workload on cluster is increased over time and leading to add more resources to cluster. +- Workload on cluster is decreased over time and leading to resources under utilization. + +To handle these scenarios currently we can +- Horizontally scale up or down cluster by the addition or removal of compute nodes +- Vertically scale up or down cluster by increasing or decreasing the node’s capacity, but the current workaround for the node resize to be captured by the cluster is only by the means of restarting Kubelet. + +The dynamic node resize will give advantages in case of scenarios like +- Handling the resource demand with limited set of machines by increasing the capacity of existing machines rather than creating new ones. +- Creating/Deleting new machine takes more time when compared to increasing/decreasing the capacity of existing ones. + +### Goals + +* Dynamically resize the node without restarting the kubelet +* Add ability to reinitialize resource managers(cpu manager, memory manager) to adopt changes in machine resource + + +### Non-Goals + +* Update the autoscaler to utilize dynamic node resize. + +## Proposal + +This KEP adds a polling mechanism in kubelet to fetch the machine-info using cadvisor, The information will be fetched repeatedly based on configured time interval. +Later node status updater will take care of updating this information at node level. + +This KEP also improvises the resource managers like memory manager, cpu manager initialization and reinitialization so that these resource managers will +adapt to the dynamic change in machine configurations. + +### User Stories (Optional) + +#### Story 1 + +As a cluster admin, I want to increase the cluster resource capacity without adding a new node to the cluster. + +#### Story 2 + +As a cluster admin, I want to decrease the cluster resource capacity without removing an existing node from the cluster. + +### Notes/Constraints/Caveats (Optional) + + + + + +### Risks and Mitigations + + + +## Design Details + +Below diagram is shows the interaction between kubelet and cadvisor + +``` ++----------+ +-----------+ +-----------+ +--------------+ +| | | | | | | | +| node | | kubelet | | cadvisor | | machine-info | +| | | | | | | | ++----+-----+ +-----+-----+ +-----+-----+ +-------+------+ + | | | | + | | poll | | + | |------------------------------>| | + | | | | + | | | | + | | | fetch | + | | |------------------------------->| + | | | | + | | | | + | | | | + | | | update | + | | |<-------------------------------| + | | | | + | | update | | + | |<------------------------------| | + | | | | + | | | | + | | | | + | node status update | | | + |<-------------------------------| | | + | | | | + | | | | + | re-run pod admission | | | + |<-------------------------------| | | | + | | | | + | re-initialize resource managers| | | + |<-------------------------------| | | | + | | | | + +``` + +The interaction sequence is as follows +1. Kubelet will be polling cadvisor with interval of configured time like one minute to fetch the machine resource information +2. Cadvisor will fetch and update the machine resource information +3. kubelet cache will be updated with the latest machine resource information +4. node status updater will update the node's status with new resource information +5. In case of shrink in cluster resources will re-run the pod admission to evict pods which lack resources +6. kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes + +Note: In case of increase in cluster resources scheduler will automatically schedule any pending pods + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name:DynamicNodeResize + - Components depending on the feature gate: kubelet +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-node/3953-dynamic-node-resize/kep.yaml b/keps/sig-node/3953-dynamic-node-resize/kep.yaml new file mode 100644 index 00000000000..bf291d4d897 --- /dev/null +++ b/keps/sig-node/3953-dynamic-node-resize/kep.yaml @@ -0,0 +1,33 @@ +title: Dynamic node resize +kep-number: 3953 +authors: + - "@Karthik-K-N" + - "@mkumatag" + - "@kishen-v" +owning-sig: sig-node +participating-sigs: + - sig-node +status: provisional +creation-date: 2023-10-04 +reviewers: + - "@smarterclayton" + - "@ffromani" + - "@SergeyKanzhelev" +approvers: + - "@sig-node-leads" +see-also: + +stage: "alpha" + +latest-milestone: "v1.28" + +milestone: + alpha: "" + beta: "" + stable: "" + +feature-gates: + - name: DynamicNodeResize + components: + - kubelet +disable-supported: true From 1501b40f3fbb08d2ea8f0e88b52bc0dcba536393 Mon Sep 17 00:00:00 2001 From: Kishen V Date: Thu, 13 Apr 2023 17:23:00 +0530 Subject: [PATCH 02/19] Fix KEP doc for node-resize. --- .../3953-dynamic-node-resize/README.md | 190 +++++++++++------- .../3953-dynamic-node-resize/kep.yaml | 6 - 2 files changed, 117 insertions(+), 79 deletions(-) diff --git a/keps/sig-node/3953-dynamic-node-resize/README.md b/keps/sig-node/3953-dynamic-node-resize/README.md index 100d067f1c4..779c2df2e8e 100644 --- a/keps/sig-node/3953-dynamic-node-resize/README.md +++ b/keps/sig-node/3953-dynamic-node-resize/README.md @@ -14,30 +14,30 @@ tags, and then generate with `hack/update-toc.sh`. - [Release Signoff Checklist](#release-signoff-checklist) - [Summary](#summary) - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) + - [Goals](#goals) + - [Non-Goals](#non-goals) - [Proposal](#proposal) - - [User Stories (Optional)](#user-stories-optional) - - [Story 1](#story-1) - - [Story 2](#story-2) - - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - - [Risks and Mitigations](#risks-and-mitigations) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - - [Test Plan](#test-plan) - - [Prerequisite testing updates](#prerequisite-testing-updates) - - [Unit tests](#unit-tests) - - [Integration tests](#integration-tests) - - [e2e tests](#e2e-tests) - - [Graduation Criteria](#graduation-criteria) - - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - - [Version Skew Strategy](#version-skew-strategy) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - - [Feature Enablement and Rollback](#feature-enablement-and-rollback) - - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) - - [Monitoring Requirements](#monitoring-requirements) - - [Dependencies](#dependencies) - - [Scalability](#scalability) - - [Troubleshooting](#troubleshooting) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) @@ -74,30 +74,29 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary -This proposal aims at enabling dynamic node resizing. This will help in resizing cluster resource capacity by just updating resources of nodes rather than adding new node or removing existing node and -also enable node configurations to be reflected at the node and cluster levels automatically without the need to manually resetting the kubelet +The proposal aims at enabling dynamic node resizing. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node or removing existing node from a cluster. +The updated node configurations are to be reflected at the node and cluster levels automatically without the need to reset the kubelet. -This proposal also aims to improvise the initialisation and reinitialisation of resource managers like cpu manager, memory manager with the dynamic change in machine's CPU and memory configurations. +This proposal also aims to improve the initialization and reinitialization of resource managers, such as the CPU manager and memory manager, in response to changes in a node's CPU and memory configurations. ## Motivation -In a typical Kubernetes environment, the cluster resources may need to be altered because of various reasons like -- Incorrect resource assignment while creating a cluster. -- Workload on cluster is increased over time and leading to add more resources to cluster. -- Workload on cluster is decreased over time and leading to resources under utilization. +In a typical Kubernetes environment, the cluster resources may need to be altered due to following reasons: +- Incorrect resource assignment during cluster creation. +- Increased workload over time, leading to the need for additional resources in the cluster. +- Decreased workload over time, leading to resource underutilization in the cluster. -To handle these scenarios currently we can -- Horizontally scale up or down cluster by the addition or removal of compute nodes -- Vertically scale up or down cluster by increasing or decreasing the node’s capacity, but the current workaround for the node resize to be captured by the cluster is only by the means of restarting Kubelet. +To handle these scenarios, we can: +- Horizontally scale up or down the cluster by adding or removing compute nodes. +- Vertically scale up or down the cluster by increasing or decreasing node capacity. However, currently, the workaround for capturing node resizing in the cluster involves restarting the Kubelet. -The dynamic node resize will give advantages in case of scenarios like -- Handling the resource demand with limited set of machines by increasing the capacity of existing machines rather than creating new ones. -- Creating/Deleting new machine takes more time when compared to increasing/decreasing the capacity of existing ones. +Dynamic node resizing will provide advantages in scenarios such as: +- Handling resource demand with a limited set of nodes by increasing the capacity of existing nodes instead of creating new nodes. +- Creating or deleting new nodes takes more time compared to increasing or decreasing the capacity of existing nodes. ### Goals -* Dynamically resize the node without restarting the kubelet -* Add ability to reinitialize resource managers(cpu manager, memory manager) to adopt changes in machine resource - +* Dynamically resize the node without restarting the kubelet. +* Ability to reinitialize resource managers (CPU manager, memory manager) to adopt changes in node's resource. ### Non-Goals @@ -105,21 +104,19 @@ The dynamic node resize will give advantages in case of scenarios like ## Proposal -This KEP adds a polling mechanism in kubelet to fetch the machine-info using cadvisor, The information will be fetched repeatedly based on configured time interval. -Later node status updater will take care of updating this information at node level. +This KEP adds a polling mechanism in kubelet to fetch the machine-information from cAdvisor's cache, The information will be fetched periodically based on a configured time interval, after which the node status updater is responsible for updating this information at node level in the cluster. -This KEP also improvises the resource managers like memory manager, cpu manager initialization and reinitialization so that these resource managers will -adapt to the dynamic change in machine configurations. +Additionally, this KEP aims to improve the initialization and reinitialization of resource managers, such as the memory manager and CPU manager, so that they can adapt to change in node's configurations. ### User Stories (Optional) #### Story 1 -As a cluster admin, I want to increase the cluster resource capacity without adding a new node to the cluster. +As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster. #### Story 2 -As a cluster admin, I want to decrease the cluster resource capacity without removing an existing node from the cluster. +As a cluster admin, I must be able to decrease the cluster resource capacity without removing an existing node from the cluster. ### Notes/Constraints/Caveats (Optional) @@ -148,13 +145,13 @@ Consider including folks who also work outside the SIG or subproject. ## Design Details -Below diagram is shows the interaction between kubelet and cadvisor +Below diagram is shows the interaction between kubelet and cAdvisor. ``` +----------+ +-----------+ +-----------+ +--------------+ | | | | | | | | -| node | | kubelet | | cadvisor | | machine-info | -| | | | | | | | +| node | | kubelet | | cAdvisor | | machine-info | +| | | | | cache | | | +----+-----+ +-----+-----+ +-----+-----+ +-------+------+ | | | | | | poll | | @@ -177,7 +174,7 @@ Below diagram is shows the interaction between kubelet and cadvisor | node status update | | | |<-------------------------------| | | | | | | - | | | | + | if shrink in resource | | | | re-run pod admission | | | |<-------------------------------| | | | | | | | @@ -188,14 +185,76 @@ Below diagram is shows the interaction between kubelet and cadvisor ``` The interaction sequence is as follows -1. Kubelet will be polling cadvisor with interval of configured time like one minute to fetch the machine resource information -2. Cadvisor will fetch and update the machine resource information -3. kubelet cache will be updated with the latest machine resource information -4. node status updater will update the node's status with new resource information -5. In case of shrink in cluster resources will re-run the pod admission to evict pods which lack resources -6. kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes +1. Kubelet will be polling in interval of configured time to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes. +3. Kubelet's cache will be updated with the latest machine resource information. +4. Node status updater will update the node's status with the latest resource information. +5. In case of a shrink in cluster resources rerun the pod admission and the pod admission will evict pods +6. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes. + +Note: In case of increase in cluster resources, the scheduler will automatically schedule any pending pods. + +**Kubelet Configuration changes** + +* A new boolean variable `dynamicNodeResize` will be added to kubelet configuration. +* `dynamicNodeResize` will be false by default. +* User need to set `dynamicNodeResize` to true make use of Dynamic Node Resize. + +**Proposed Code changes** + +**Dynamic Node resize and Pod Re-admission logic** + +```azure + if kl.kubeletConfiguration.DynamicNodeResize { + // Handle the node dynamic resize + machineInfo, err := kl.cadvisor.MachineInfo() + if err != nil { + klog.ErrorS(err, "Error fetching machine info") + } else { + cachedMachineInfo, _ := kl.GetCachedMachineInfo() + + if !reflect.DeepEqual(cachedMachineInfo, machineInfo) { + kl.setCachedMachineInfo(machineInfo) + + // Resync the resource managers + if err := kl.ResyncComponents(machineInfo); err != nil { + klog.ErrorS(err, "Error resyncing the kubelet components with machine info") + } + + //Rerun pod admission only in case of shrink in cluster resources + if machineInfo.NumCores < cachedMachineInfo.NumCores || machineInfo.MemoryCapacity < cachedMachineInfo.MemoryCapacity { + klog.InfoS("Observed shrink in nod resources, rerunning pod admission") + kl.HandlePodAdditions(activePods) + } + } + } + } +``` + +**Changes to resource managers to adapt to dynamic resize** + +1. Adding ResyncComponents() method to ContainerManager interface +```azure + // Manages the containers running on a machine. + type ContainerManager interface { + . + . + // ResyncComponents will resyc the resource managers like cpu, memory and topology managers + // with updated machineInfo + ResyncComponents(machineInfo *cadvisorapi.MachineInfo) error + . + . + ) +``` + +2. Adding a method Sync to all the resource managers and will be invoked once there is dynamic resource change. + +```azure + // Sync will sync the CPU Manager with the latest machine info + Sync(machineInfo *cadvisorapi.MachineInfo) error +``` + -Note: In case of increase in cluster resources scheduler will automatically schedule any pending pods +Note: PoC code changes: https://github.com/kubernetes/kubernetes/pull/115755 ### Test Plan @@ -212,26 +271,11 @@ implementing this enhancement to ensure the enhancements have also solid foundat ##### Unit tests - - - +1. Add necessary tests in kubelet_node_status_test.go to check for the node status behaviour with dynamic node resize. +2. Add necessary tests in kubelet_pods_test.go to check for the pod cleanup and pod addition workflow. +3. Add necessary tests in eventhandlers_test.go to check for scheduler behaviour with dynamic node capacity change. +4. Add necessary tests in resource managers to check for managers behaviour to adopt dynamic node capacity change. -- ``: `` - `` ##### Integration tests diff --git a/keps/sig-node/3953-dynamic-node-resize/kep.yaml b/keps/sig-node/3953-dynamic-node-resize/kep.yaml index bf291d4d897..edd968e005e 100644 --- a/keps/sig-node/3953-dynamic-node-resize/kep.yaml +++ b/keps/sig-node/3953-dynamic-node-resize/kep.yaml @@ -25,9 +25,3 @@ milestone: alpha: "" beta: "" stable: "" - -feature-gates: - - name: DynamicNodeResize - components: - - kubelet -disable-supported: true From 16af5dc2eea1716a47417ac1edbd5ba1b3c876aa Mon Sep 17 00:00:00 2001 From: Karthik Bhat Date: Tue, 10 Sep 2024 16:28:12 +0530 Subject: [PATCH 03/19] Update to emphasis on scale up of resoures --- .../3953-dynamic-node-resize/README.md | 91 ++++++------------- .../3953-dynamic-node-resize/kep.yaml | 2 +- 2 files changed, 31 insertions(+), 62 deletions(-) diff --git a/keps/sig-node/3953-dynamic-node-resize/README.md b/keps/sig-node/3953-dynamic-node-resize/README.md index 779c2df2e8e..678d69821dc 100644 --- a/keps/sig-node/3953-dynamic-node-resize/README.md +++ b/keps/sig-node/3953-dynamic-node-resize/README.md @@ -74,7 +74,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary -The proposal aims at enabling dynamic node resizing. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node or removing existing node from a cluster. +The proposal aims at enabling dynamic node resizing. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node to a cluster. The updated node configurations are to be reflected at the node and cluster levels automatically without the need to reset the kubelet. This proposal also aims to improve the initialization and reinitialization of resource managers, such as the CPU manager and memory manager, in response to changes in a node's CPU and memory configurations. @@ -83,15 +83,14 @@ This proposal also aims to improve the initialization and reinitialization of re In a typical Kubernetes environment, the cluster resources may need to be altered due to following reasons: - Incorrect resource assignment during cluster creation. - Increased workload over time, leading to the need for additional resources in the cluster. -- Decreased workload over time, leading to resource underutilization in the cluster. To handle these scenarios, we can: -- Horizontally scale up or down the cluster by adding or removing compute nodes. -- Vertically scale up or down the cluster by increasing or decreasing node capacity. However, currently, the workaround for capturing node resizing in the cluster involves restarting the Kubelet. +- Horizontally scale up the cluster by adding compute nodes. +- Vertically scale up the cluster by increasing node capacity. However, currently, the workaround for capturing node resizing in the cluster involves restarting the Kubelet. Dynamic node resizing will provide advantages in scenarios such as: - Handling resource demand with a limited set of nodes by increasing the capacity of existing nodes instead of creating new nodes. -- Creating or deleting new nodes takes more time compared to increasing or decreasing the capacity of existing nodes. +- Creating new nodes takes more time compared to increasing the capacity of existing nodes. ### Goals @@ -101,9 +100,13 @@ Dynamic node resizing will provide advantages in scenarios such as: ### Non-Goals * Update the autoscaler to utilize dynamic node resize. +* Dynamically adjust system reserved and kube reserved values. ## Proposal +This KEP aims to support the dynamic resize of compute resources of node with dynamic scale up of resources. +Dynamic scale down of resources will be proposed in separate KEP in future. + This KEP adds a polling mechanism in kubelet to fetch the machine-information from cAdvisor's cache, The information will be fetched periodically based on a configured time interval, after which the node status updater is responsible for updating this information at node level in the cluster. Additionally, this KEP aims to improve the initialization and reinitialization of resource managers, such as the memory manager and CPU manager, so that they can adapt to change in node's configurations. @@ -114,10 +117,6 @@ Additionally, this KEP aims to improve the initialization and reinitialization o As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster. -#### Story 2 - -As a cluster admin, I must be able to decrease the cluster resource capacity without removing an existing node from the cluster. - ### Notes/Constraints/Caveats (Optional) @@ -145,66 +144,42 @@ Consider including folks who also work outside the SIG or subproject. ## Design Details + Below diagram is shows the interaction between kubelet and cAdvisor. -``` -+----------+ +-----------+ +-----------+ +--------------+ -| | | | | | | | -| node | | kubelet | | cAdvisor | | machine-info | -| | | | | cache | | | -+----+-----+ +-----+-----+ +-----+-----+ +-------+------+ - | | | | - | | poll | | - | |------------------------------>| | - | | | | - | | | | - | | | fetch | - | | |------------------------------->| - | | | | - | | | | - | | | | - | | | update | - | | |<-------------------------------| - | | | | - | | update | | - | |<------------------------------| | - | | | | - | | | | - | | | | - | node status update | | | - |<-------------------------------| | | - | | | | - | if shrink in resource | | | - | re-run pod admission | | | - |<-------------------------------| | | | - | | | | - | re-initialize resource managers| | | - |<-------------------------------| | | | - | | | | +```mermaid +sequenceDiagram + participant node + participant kubelet + participant cAdvisor-cache + participant machine-info + kubelet->>cAdvisor-cache: Poll + cAdvisor-cache->>machine-info: fetch + machine-info->>cAdvisor-cache: update + cAdvisor-cache->>kubelet: update + kubelet->>node: node status update + kubelet->>node: re-initialize resource managers ``` The interaction sequence is as follows -1. Kubelet will be polling in interval of configured time to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes. -3. Kubelet's cache will be updated with the latest machine resource information. -4. Node status updater will update the node's status with the latest resource information. -5. In case of a shrink in cluster resources rerun the pod admission and the pod admission will evict pods -6. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes. +1. Kubelet will be polling in interval to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes. +2. Kubelet's cache will be updated with the latest machine resource information. +3. Node status updater will update the node's status with the latest resource information. +4. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes. Note: In case of increase in cluster resources, the scheduler will automatically schedule any pending pods. **Kubelet Configuration changes** -* A new boolean variable `dynamicNodeResize` will be added to kubelet configuration. -* `dynamicNodeResize` will be false by default. -* User need to set `dynamicNodeResize` to true make use of Dynamic Node Resize. +* Add a variable to configure the interval to fetch the updated machine information. **Proposed Code changes** **Dynamic Node resize and Pod Re-admission logic** -```azure - if kl.kubeletConfiguration.DynamicNodeResize { +```go + if utilfeature.DefaultFeatureGate.Enabled(features.DynamicNodeResize) { // Handle the node dynamic resize machineInfo, err := kl.cadvisor.MachineInfo() if err != nil { @@ -219,12 +194,6 @@ Note: In case of increase in cluster resources, the scheduler will automatically if err := kl.ResyncComponents(machineInfo); err != nil { klog.ErrorS(err, "Error resyncing the kubelet components with machine info") } - - //Rerun pod admission only in case of shrink in cluster resources - if machineInfo.NumCores < cachedMachineInfo.NumCores || machineInfo.MemoryCapacity < cachedMachineInfo.MemoryCapacity { - klog.InfoS("Observed shrink in nod resources, rerunning pod admission") - kl.HandlePodAdditions(activePods) - } } } } @@ -233,7 +202,7 @@ Note: In case of increase in cluster resources, the scheduler will automatically **Changes to resource managers to adapt to dynamic resize** 1. Adding ResyncComponents() method to ContainerManager interface -```azure +```go // Manages the containers running on a machine. type ContainerManager interface { . @@ -248,7 +217,7 @@ Note: In case of increase in cluster resources, the scheduler will automatically 2. Adding a method Sync to all the resource managers and will be invoked once there is dynamic resource change. -```azure +```go // Sync will sync the CPU Manager with the latest machine info Sync(machineInfo *cadvisorapi.MachineInfo) error ``` diff --git a/keps/sig-node/3953-dynamic-node-resize/kep.yaml b/keps/sig-node/3953-dynamic-node-resize/kep.yaml index edd968e005e..ea02aca8620 100644 --- a/keps/sig-node/3953-dynamic-node-resize/kep.yaml +++ b/keps/sig-node/3953-dynamic-node-resize/kep.yaml @@ -19,7 +19,7 @@ see-also: stage: "alpha" -latest-milestone: "v1.28" +latest-milestone: "v1.32" milestone: alpha: "" From 6f55c96b6295c3c1ce88345ecf2af23c79155413 Mon Sep 17 00:00:00 2001 From: Karthik Bhat Date: Mon, 13 Jan 2025 19:46:43 +0530 Subject: [PATCH 04/19] Rename the KEP to match to the updated scope --- .../README.md | 286 ++++++------------ .../kep.yaml | 5 +- 2 files changed, 103 insertions(+), 188 deletions(-) rename keps/sig-node/{3953-dynamic-node-resize => 3953-node-resource-hot-plug}/README.md (74%) rename keps/sig-node/{3953-dynamic-node-resize => 3953-node-resource-hot-plug}/kep.yaml (82%) diff --git a/keps/sig-node/3953-dynamic-node-resize/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md similarity index 74% rename from keps/sig-node/3953-dynamic-node-resize/README.md rename to keps/sig-node/3953-node-resource-hot-plug/README.md index 678d69821dc..44e2fc07ee2 100644 --- a/keps/sig-node/3953-dynamic-node-resize/README.md +++ b/keps/sig-node/3953-node-resource-hot-plug/README.md @@ -1,4 +1,4 @@ -# KEP-3953: Node dynamic resize +# KEP-3953: Node Resource Hot Plug +### Notes/Constraints/Caveats (Optional) ### Risks and Mitigations - +1. Node resource hot plugging is an opt-in feature, merging the + feature related changes won't impact existing workloads. Moreover, the feature + will be rolled out gradually, beginning with an alpha release for testing and + gathering feedback. This will be followed by beta and GA releases as the + feature matures and potential problems and improvements are addressed. +2. Though the node resource is updated dynamically, the dynamic data is fetched from cAdvisor and its well integrated with kubelet. +3. Resource manager are updated to adapt to the dynamic node reconfigurations, Enough tests should be added to make sure its not affecting the existing functionalities. ## Design Details @@ -170,17 +157,13 @@ The interaction sequence is as follows Note: In case of increase in cluster resources, the scheduler will automatically schedule any pending pods. -**Kubelet Configuration changes** - -* Add a variable to configure the interval to fetch the updated machine information. - **Proposed Code changes** -**Dynamic Node resize and Pod Re-admission logic** +**Dynamic Node Scale Up logic** ```go - if utilfeature.DefaultFeatureGate.Enabled(features.DynamicNodeResize) { - // Handle the node dynamic resize + if utilfeature.DefaultFeatureGate.Enabled(features.NodeResourceHotPlug) { + // Handle the node dynamic scale up machineInfo, err := kl.cadvisor.MachineInfo() if err != nil { klog.ErrorS(err, "Error fetching machine info") @@ -199,7 +182,7 @@ Note: In case of increase in cluster resources, the scheduler will automatically } ``` -**Changes to resource managers to adapt to dynamic resize** +**Changes to resource managers to adapt to dynamic scale up of resources** 1. Adding ResyncComponents() method to ContainerManager interface ```go @@ -222,119 +205,42 @@ Note: In case of increase in cluster resources, the scheduler will automatically Sync(machineInfo *cadvisorapi.MachineInfo) error ``` - -Note: PoC code changes: https://github.com/kubernetes/kubernetes/pull/115755 - ### Test Plan [x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement. -##### Prerequisite testing updates - - - ##### Unit tests -1. Add necessary tests in kubelet_node_status_test.go to check for the node status behaviour with dynamic node resize. +1. Add necessary tests in kubelet_node_status_test.go to check for the node status behaviour with dynamic node scale up. 2. Add necessary tests in kubelet_pods_test.go to check for the pod cleanup and pod addition workflow. 3. Add necessary tests in eventhandlers_test.go to check for scheduler behaviour with dynamic node capacity change. 4. Add necessary tests in resource managers to check for managers behaviour to adopt dynamic node capacity change. -##### Integration tests - - - -- : - ##### e2e tests - +Following scenarios need to be covered: -- : +* Node resource information before and after resource hot plug. +* State of Pending pods due to lack of resources after resource hot plug. +* Resource manager states after the resynch of components. ### Graduation Criteria - ### Upgrade / Downgrade Strategy @@ -408,7 +314,7 @@ well as the [existing list] of feature gates. --> - [x] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name:DynamicNodeResize + - Feature gate name:NodeResourceHotPlug - Components depending on the feature gate: kubelet - [ ] Other - Describe the mechanism: @@ -419,40 +325,26 @@ well as the [existing list] of feature gates. ###### Does enabling the feature change any default behavior? - +No. This feature is guarded by a feature gate. Existing default behavior does not change if the +feature is not used. +Even if the feature is enabled via feature gate, If there is no change in +node configuration the system will continue to work in the same way. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? - +Yes. Once disabled any hot plug of resources won't reflect at the cluster level without kubelet restart. ###### What happens if we reenable the feature if it was previously rolled back? -###### Are there any tests for feature enablement/disablement? +If the feature is reenabled again, the node resources can be hot plugged in again. Cluster will be automatically udpated +with the new resource information. - +Yes, the tests will be added along with alpha implementation. +* Validate the hot plug of resource to machine is updated at the node resource level. +* Validate the hot plug of resource made the pending pods to transition into running state. +* Validate the resource managers are update with the latest machine information after hot plug of resources. ### Rollout, Upgrade and Rollback Planning @@ -472,6 +364,11 @@ rollout. Similarly, consider large clusters and how enablement/disablement will rollout across nodes. --> +Rollout may fail if the resource managers are not re-synced properly due to programatic errors. +In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain +in the pending state only. +Rollback failure should not affect running workloads. + ###### What specific metrics should inform a rollback? +In case of pending pods and hot plug of resource but still there is no change `scheduler_pending_pods` metric +means the feature is not working as expected. + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +It will be tested manually as a part of implementation and there will also be automated tests to cover the scenarios. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? - +No ### Monitoring Requirements +This feature will be built into kubelet and behind a feature gate. Examining the kubelet feature gate would help +in determining whether the feature is used. The enablement of the kubelet feature gate can be determined from the +`kubernetes_feature_enabled` metric. + ###### How can someone using this feature know that it is working for their instance? -- [ ] Events - - Event Reason: -- [ ] API .status - - Condition name: - - Other field: -- [ ] Other (treat as last resort) - - Details: +End user can do a hot plug of resource and verify the same change as reflected at the node resource level. +In case there were any pending pods prior to resource hot plug, those pods should transition into Running with addition +of new resources. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? @@ -545,19 +447,16 @@ high level (needs more precise definitions) those may be things like: These goals will help you determine what you need to measure (SLIs) in the next question. --> - +No increase in the `scheduler_pending_pods` rate. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? -- [ ] Metrics - - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: -- [ ] Other (treat as last resort) - - Details: +- [X] Metrics + - Metric name: `scheduler_pending_pods` + - Components exposing the metric: scheduler ###### Are there any missing metrics that would be useful to have to improve observability of this feature? @@ -565,7 +464,7 @@ Pick one more of these and delete the rest. Describe the metrics themselves and the reasons why they weren't added (e.g., cost, implementation difficulties, etc.). --> - +No ### Dependencies +No, It does not depend on any service running on the cluster, But depends on cAdvisor package to fetch +the machine resource information. ### Scalability @@ -616,6 +517,10 @@ Focusing mostly on: heartbeats, leader election, etc.) --> +No, It won't add/modify any user facing APIs. +The resource managers might need to be updated with new methods to resync their components with updated +machine information. + ###### Will enabling / using this feature result in introducing new API types? - +No ###### Will enabling / using this feature result in any new calls to the cloud provider? - +No ###### Will enabling / using this feature result in increasing size or count of the existing API objects? - +No ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? - +Negligible, In the case of resource hot plug the resource manager may take some time to resync. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? - +Negligible computational overhead might be introduced into kubelet as it periodically needs to fetch machine information +from cAdvisor cache and resync all the resource managers with the updated machine information. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +Yes, It could. +Since the nodes computational capacity is increased dynamically there might be more pods scheduled on the node. +This is however be mitigated by maxPods kubelet configuration that limits the number of pods on a node. ### Troubleshooting @@ -692,6 +601,10 @@ details). For now, we leave it here. ###### How does this feature react if the API server and/or etcd is unavailable? +This feature is node local and mainly handled in kubelet, It has no dependency on etcd. +In case there are pending pods and there is hot plug of resources, The scheduler relies on the API server to fetch node information. +Without access to the API server, it cannot make scheduling decisions as the node resources are not updated. The pending pods would remain in same condition. + ###### What are other known failure modes? +This feature mainly does two things fetch machine information from cAdvisor and reinitialize resource managers. +Failure scenarios can occur in cAdvisor level that is if it wrongly updated with incorrect machine information. + + ###### What steps should be taken if SLOs are not being met to determine the problem? +If enabling this feature causes performance degradation, its suggested not to hot plug resources and restart the kubelet +to manually to continue operation as before. + ## Implementation History @@ -730,16 +650,10 @@ Why should this KEP _not_ be implemented? ## Alternatives +Existing and the alternative to this effort would be restarting the kubelet manually each time after the node resize. + - -## Infrastructure Needed (Optional) - - diff --git a/keps/sig-node/3953-dynamic-node-resize/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml similarity index 82% rename from keps/sig-node/3953-dynamic-node-resize/kep.yaml rename to keps/sig-node/3953-node-resource-hot-plug/kep.yaml index ea02aca8620..232e4542e04 100644 --- a/keps/sig-node/3953-dynamic-node-resize/kep.yaml +++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml @@ -1,4 +1,4 @@ -title: Dynamic node resize +title: Node Resource Hot Plug kep-number: 3953 authors: - "@Karthik-K-N" @@ -13,13 +13,14 @@ reviewers: - "@smarterclayton" - "@ffromani" - "@SergeyKanzhelev" + - "@haircommander" approvers: - "@sig-node-leads" see-also: stage: "alpha" -latest-milestone: "v1.32" +latest-milestone: "v1.33" milestone: alpha: "" From 0af00c8ceea02f4441205b0c6f6fb988497c4b38 Mon Sep 17 00:00:00 2001 From: Karthik Bhat Date: Thu, 30 Jan 2025 20:50:27 +0530 Subject: [PATCH 05/19] Address reveiw comments --- keps/prod-readiness/sig-node/3953.yaml | 3 +++ .../3953-node-resource-hot-plug/README.md | 19 +++++++++++++++++-- .../3953-node-resource-hot-plug/kep.yaml | 18 ++++++++++++++++++ 3 files changed, 38 insertions(+), 2 deletions(-) create mode 100644 keps/prod-readiness/sig-node/3953.yaml diff --git a/keps/prod-readiness/sig-node/3953.yaml b/keps/prod-readiness/sig-node/3953.yaml new file mode 100644 index 00000000000..cc389c5e866 --- /dev/null +++ b/keps/prod-readiness/sig-node/3953.yaml @@ -0,0 +1,3 @@ +kep-number: 3953 +alpha: + approver: "@deads2k" \ No newline at end of file diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md index 44e2fc07ee2..dfb06debe9d 100644 --- a/keps/sig-node/3953-node-resource-hot-plug/README.md +++ b/keps/sig-node/3953-node-resource-hot-plug/README.md @@ -17,8 +17,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) - - [User Stories (Optional)](#user-stories-optional) + - [User Stories](#user-stories) - [Story 1](#story-1) + - [Story 2](#story-2) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -26,7 +27,10 @@ tags, and then generate with `hack/update-toc.sh`. - [Unit tests](#unit-tests) - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) + - [Phase 1: Alpha (target 1.33)](#phase-1-alpha-target-133) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Upgrade](#upgrade) + - [Downgrade](#downgrade) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - [Feature Enablement and Rollback](#feature-enablement-and-rollback) @@ -38,7 +42,6 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) -- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) ## Release Signoff Checklist @@ -256,6 +259,16 @@ enhancement: cluster required to make on upgrade, in order to make use of the enhancement? --> +##### Upgrade + +To upgrade the cluster to use this feature, Kubelet should be updated to enable featuregate. +Existing cluster does not have any impact as the node resources already been updated during cluster creation. + +##### Downgrade + +It's always possible to trivially downgrade to the previous kubelet, It does not have any impact as the future node resource hot plug wont be reflected in cluster +without manual kubelet restart. + ### Version Skew Strategy +Not relevant, As this kubelet specific feature and does not impact other components. + ## Production Readiness Review Questionnaire ## Release Signoff Checklist @@ -74,28 +76,31 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary -The proposal aims at enabling hot plugging of node compute resources. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node to a cluster. -The updated node configurations are to be reflected at the node and cluster levels automatically without the need to reset the kubelet. - -This proposal also aims to improve the initialization and reinitialization of resource managers, such as the CPU manager and memory manager, in response to changes in a node's CPU and memory configurations. - +The proposal seeks to facilitate hot plugging of node compute resources, thereby streamlining cluster resource capacity updates through node compute resource resizing, rather than introducing new nodes to the cluster. +The revised node configurations will be automatically propagated at both the node and cluster levels, eliminating the necessity for a kubelet reset. +Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations. +This approach aims to optimize resource management, improve scalability, and minimize disruptions to cluster operations. ## Motivation -In a typical Kubernetes environment, the cluster resources may need to be altered due to following reasons: -- Incorrect resource assignment during cluster creation. -- Increased workload over time, leading to the need for additional resources in the cluster. -To handle these scenarios, we can: -- Horizontally scale up the cluster by adding compute nodes. -- Vertically scale up the cluster by increasing node capacity. However, currently, the workaround for capturing node resizing in the cluster involves restarting the Kubelet. +In a conventional Kubernetes environment, the cluster resources might necessitate modification due to the following factors: +- Inaccurate resource allocation during cluster initialization. +- Escalating workload over time, necessitating supplementary resources within the cluster. -Node resource hot plugging will provide advantages in scenarios such as: -- Handling resource demand with a limited set of nodes by increasing the capacity of existing nodes instead of creating new nodes. -- Creating new nodes takes more time compared to increasing the capacity of existing nodes. +To address these situations, we can: +- Horizontally scale the cluster by incorporating additional compute nodes. +- Vertically scale the cluster by augmenting node capacity. Currently, the method to capture node resizing within the cluster entails restarting the Kubelet. +These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization. However, the limitation of requiring a Kubelet restart for node resizing is an area for potential improvement. + +Node resource hot plugging offers benefits in situations like: +- Managing resource demand with a restricted number of nodes by enhancing the capacity of current nodes rather than creating new ones. +- The process of creating new nodes is more time-consuming compared to augmenting the capacity of existing nodes. + +This approach allows for more efficient resource management and quicker capacity adjustments, optimizing the utilization of existing hardware. ### Goals -* Dynamically scale up the node by hot plugging resources and without restarting the kubelet. -* Ability to reinitialize resource managers (CPU manager, memory manager) to adopt changes in node's resource. +* Achieve seamless node capacity expansion through hot plugging resources, all without necessitating a kubelet restart. +* Facilitate the reinitialization of resource managers (CPU manager, memory manager) to accommodate alterations in the node's resource allocation. ### Non-Goals @@ -103,12 +108,12 @@ Node resource hot plugging will provide advantages in scenarios such as: * Hot unplug of node resources. * Update the autoscaler to utilize resource hot plugging. - ## Proposal -This KEP aims to support the node resource hot plugging by adding a polling mechanism in kubelet to fetch the machine-information from cAdvisor's cache which is already updated periodically, This information will be fetched periodically by kubelet, after which the node status updater is responsible for updating this information at node level in the cluster. -Additionally, this KEP aims to improve the initialization and reinitialization of resource managers, such as the memory manager and CPU manager, so that they can adapt to change in node's configurations. +This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically. +The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster. +Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations. ### User Stories @@ -118,7 +123,7 @@ As a cluster admin, I must be able to increase the cluster resource capacity wit #### Story 2 -As a cluster admin, I must be able to increase the cluster resource capacity without need to restarting the kubelet. +As a cluster admin, I must be able to increase the cluster resource capacity without need to restart the kubelet. ### Notes/Constraints/Caveats (Optional) @@ -158,7 +163,20 @@ The interaction sequence is as follows 3. Node status updater will update the node's status with the latest resource information. 4. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes. -Note: In case of increase in cluster resources, the scheduler will automatically schedule any pending pods. +With increase in cluster resources the following components will updated + +1. Scheduler + * Scheduler will automatically schedule any pending pods. + + +2. Change in Swap Memory limit + * Currently, the swap memory limit is calculated as + `(/)*` + * So increase in nodeTotalMemory will result in updated swap memory limit. + + +3. Change in OOM score + * OOM score calculation depends on machine's memory, so the new OOM score will be updated accordingly. **Proposed Code changes** @@ -172,7 +190,9 @@ Note: In case of increase in cluster resources, the scheduler will automatically klog.ErrorS(err, "Error fetching machine info") } else { cachedMachineInfo, _ := kl.GetCachedMachineInfo() - + // Avoid collector collects it as a timestamped metric + // See PR #95210 and #97006 for more details. + machineInfo.Timestamp = time.Time{} if !reflect.DeepEqual(cachedMachineInfo, machineInfo) { kl.setCachedMachineInfo(machineInfo) @@ -204,8 +224,8 @@ Note: In case of increase in cluster resources, the scheduler will automatically 2. Adding a method Sync to all the resource managers and will be invoked once there is dynamic resource change. ```go - // Sync will sync the CPU Manager with the latest machine info - Sync(machineInfo *cadvisorapi.MachineInfo) error + // SyncMachineInfo will sync the Manager with the latest machine info + SyncMachineInfo(machineInfo *cadvisorapi.MachineInfo) error ``` ### Test Plan @@ -228,7 +248,7 @@ Following scenarios need to be covered: * Node resource information before and after resource hot plug. * State of Pending pods due to lack of resources after resource hot plug. -* Resource manager states after the resynch of components. +* Resource manager states after the resync of components. ### Graduation Criteria @@ -269,6 +289,7 @@ Existing cluster does not have any impact as the node resources already been upd It's always possible to trivially downgrade to the previous kubelet, It does not have any impact as the future node resource hot plug wont be reflected in cluster without manual kubelet restart. + ### Version Skew Strategy +Hot Unplug of resource is not supported so any decrease in node resources will be automatically updated but the Pods +re-admission is not done so Pods may be running with low resources until kubelet is restarted. ## Alternatives @@ -672,3 +695,10 @@ What other approaches did you consider, and why did you rule them out? These do not need to be as detailed as the proposal, but should include enough information to express the idea and why it was not acceptable. --> + +## Infrastructure Needed (Optional) +VMs of cluster should support hot plug of compute resources for e2e tests. + +## Future Work + +* Support hot-unplug of node resources: Hot-Unplug of resource needs a pod-readmission, A separate KEP is planned to support this feature. diff --git a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml index 618799577cd..9cbf5d26c62 100644 --- a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml +++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml @@ -14,8 +14,10 @@ reviewers: - "@ffromani" - "@SergeyKanzhelev" - "@haircommander" + - "@tallclair" approvers: - - "@sig-node-leads" + - "@haircommander" + - TBD see-also: replaces: From 4375027045d517a3f964b94a5ef733e9802681cb Mon Sep 17 00:00:00 2001 From: Karthik Bhat Date: Wed, 5 Feb 2025 15:24:04 +0530 Subject: [PATCH 07/19] Address review comments --- .../3953-node-resource-hot-plug/README.md | 104 +++++++++++++----- .../3953-node-resource-hot-plug/kep.yaml | 3 +- 2 files changed, 76 insertions(+), 31 deletions(-) diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md index 1dc0f9b0969..7c7090413f8 100644 --- a/keps/sig-node/3953-node-resource-hot-plug/README.md +++ b/keps/sig-node/3953-node-resource-hot-plug/README.md @@ -20,6 +20,7 @@ tags, and then generate with `hack/update-toc.sh`. - [User Stories](#user-stories) - [Story 1](#story-1) - [Story 2](#story-2) + - [Story 3](#story-3) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -80,7 +81,13 @@ The proposal seeks to facilitate hot plugging of node compute resources, thereby The revised node configurations will be automatically propagated at both the node and cluster levels, eliminating the necessity for a kubelet reset. Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations. This approach aims to optimize resource management, improve scalability, and minimize disruptions to cluster operations. + ## Motivation +Currently, the node's configurations are recorded solely during the kubelet bootstrap phase and subsequently cached. assuming the node's compute capacity remains unchanged throughout the cluster's lifecycle. + +However, contemporary kernel capabilities enable the dynamic addition of CPUs and memory to a node (References: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html). +This can result in Kubernetes being unaware of the node's altered compute capacities during a live-resize, causing the node to retain outdated information. +This can lead to inconsistencies or an imbalance in the cluster, affecting the optimal scheduling and deployment of workloads. In a conventional Kubernetes environment, the cluster resources might necessitate modification due to the following factors: - Inaccurate resource allocation during cluster initialization. @@ -90,13 +97,16 @@ To address these situations, we can: - Horizontally scale the cluster by incorporating additional compute nodes. - Vertically scale the cluster by augmenting node capacity. Currently, the method to capture node resizing within the cluster entails restarting the Kubelet. -These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization. However, the limitation of requiring a Kubelet restart for node resizing is an area for potential improvement. +These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization. +However, the limitation of requiring a Kubelet restart for node resizing is an area for potential improvement. + +Node resource hot plugging proves advantageous in scenarios such as: +- Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones. +- The procedure of establishing new nodes is considerably more time-intensive than expanding the capabilities of current nodes. + +Implementing this KEP will empower nodes to recognize and adapt to changes in their configurations, +thereby facilitating the efficient and effective deployment of pod workloads to nodes capable of meeting the required compute demands. -Node resource hot plugging offers benefits in situations like: -- Managing resource demand with a restricted number of nodes by enhancing the capacity of current nodes rather than creating new ones. -- The process of creating new nodes is more time-consuming compared to augmenting the capacity of existing nodes. - -This approach allows for more efficient resource management and quicker capacity adjustments, optimizing the utilization of existing hardware. ### Goals * Achieve seamless node capacity expansion through hot plugging resources, all without necessitating a kubelet restart. @@ -107,10 +117,11 @@ This approach allows for more efficient resource management and quicker capacity * Dynamically adjust system reserved and kube reserved values. * Hot unplug of node resources. * Update the autoscaler to utilize resource hot plugging. +* Re-balance workloads across the nodes. +* Update runtime/NRI plugins with host resource changes. ## Proposal - This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically. The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster. Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations. @@ -119,24 +130,34 @@ Moreover, this KEP aims to refine the initialization and reinitialization proces #### Story 1 -As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster. +Pinning of workloads to nodes with certain hardware capabilities with limited CPU and memory configurations. + Adopting this KEP will allow nodes with certain hardware capabilities to be resized to accommodate additional workloads that are dependent on particular hardware capability. #### Story 2 +As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster. + +#### Story 3 + As a cluster admin, I must be able to increase the cluster resource capacity without need to restart the kubelet. ### Notes/Constraints/Caveats (Optional) ### Risks and Mitigations -1. Node resource hot plugging is an opt-in feature, merging the - feature related changes won't impact existing workloads. Moreover, the feature - will be rolled out gradually, beginning with an alpha release for testing and - gathering feedback. This will be followed by beta and GA releases as the - feature matures and potential problems and improvements are addressed. -2. Though the node resource is updated dynamically, the dynamic data is fetched from cAdvisor and its well integrated with kubelet. -3. Resource manager are updated to adapt to the dynamic node reconfigurations, Enough tests should be added to make sure its not affecting the existing functionalities. - +- Change in OOMScoreAdjust value: + - The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity` + - However, with change in memoryCapacity post up-scale, The OOMScoreAdjust of pod post up-scale may not be inline with the + precalculated scores of pod which are deployed before. +- Change in Swap limit: + - The formula to calculate the swap limit is `/)*` + - However, with change in nodeTotalMemory and totalPodsSwapAvailable post up-scale, The swap limit of pod post up-scale may not be inline with the + precalculated scores of pod which are deployed before. +- Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload. +- Lack of coordination about change in resource availability across kubelet/runtime/plugins. + +- To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur. +- The plugins/runtime should updated to react to change in resource information on the node. ## Design Details @@ -168,15 +189,15 @@ With increase in cluster resources the following components will updated 1. Scheduler * Scheduler will automatically schedule any pending pods. +2. Change in OOM score adjust + * Currently, the OOM score adjust is calculated by + `1000 - (1000*containerMemReq)/memoryCapacity` + * So increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize. -2. Change in Swap Memory limit - * Currently, the swap memory limit is calculated as +3. Change in Swap Memory limit + * Currently, the swap memory limit is calculated by `(/)*` - * So increase in nodeTotalMemory will result in updated swap memory limit. - - -3. Change in OOM score - * OOM score calculation depends on machine's memory, so the new OOM score will be updated accordingly. + * So increase in nodeTotalMemory will result in updated swap memory limit for pods deployed post resize. **Proposed Code changes** @@ -246,7 +267,10 @@ to implement this enhancement. Following scenarios need to be covered: -* Node resource information before and after resource hot plug. +* Node resource information before and after resource hot plug for the following scenarios. + * upsize -> downsize + * upsize -> downsize -> upsize + * downsize- > upsize * State of Pending pods due to lack of resources after resource hot plug. * Resource manager states after the resync of components. @@ -411,7 +435,7 @@ Rollback failure should not affect running workloads. What signals should users be paying attention to when the feature is young that might indicate a serious problem? --> - +If there is significant increase in `node_resize_resync_errors_total` metric means the feature is not working as expected. In case of pending pods and hot plug of resource but still there is no change `scheduler_pending_pods` metric means the feature is not working as expected. @@ -440,6 +464,10 @@ For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field. --> +Monitor the metrics +- `node_resize_resync_request_total` +- `node_resize_resync_errors_total` + ###### How can an operator determine if the feature is in use by workloads? -No increase in the `scheduler_pending_pods` rate. + +For each node, the value of the metric `node_resize_resync_request_total` is expected to match the number of time the node is resized. +For each node, the value of the metric `node_resize_resync_errors_total` is expected to be zero. + + ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - [X] Metrics - - Metric name: `scheduler_pending_pods` - - Components exposing the metric: scheduler + - Metric name: + - `node_resize_resync_request_total` + - `node_resize_resync_errors_total` + - Components exposing the metric: kubelet ###### Are there any missing metrics that would be useful to have to improve observability of this feature? @@ -500,7 +537,9 @@ Pick one more of these and delete the rest. Describe the metrics themselves and the reasons why they weren't added (e.g., cost, implementation difficulties, etc.). --> -No +- `node_resize_resync_request_total` +- `node_resize_resync_errors_total` + ### Dependencies - [Release Signoff Checklist](#release-signoff-checklist) +- [Glossary](#glossary) - [Summary](#summary) - [Motivation](#motivation) - [Goals](#goals) @@ -21,6 +22,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Story 1](#story-1) - [Story 2](#story-2) - [Story 3](#story-3) + - [Story 4](#story-4) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) @@ -75,10 +77,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release* [kubernetes/kubernetes]: https://git.k8s.io/kubernetes [kubernetes/website]: https://git.k8s.io/website +## Glossary + +hotplug: dynamically add compute resources (CPU, memory) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running) + +hotunplug: dynamically remove compute resources (CPU, memory) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running) + + ## Summary The proposal seeks to facilitate hot plugging of node compute resources, thereby streamlining cluster resource capacity updates through node compute resource resizing, rather than introducing new nodes to the cluster. -The revised node configurations will be automatically propagated at both the node and cluster levels, eliminating the necessity for a kubelet reset. +The revised node configurations will be automatically propagated at both the node and cluster levels. Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations. This approach aims to optimize resource management, improve scalability, and minimize disruptions to cluster operations. @@ -95,21 +104,23 @@ In a conventional Kubernetes environment, the cluster resources might necessitat To address these situations, we can: - Horizontally scale the cluster by incorporating additional compute nodes. -- Vertically scale the cluster by augmenting node capacity. Currently, the method to capture node resizing within the cluster entails restarting the Kubelet. +- Vertically scale the cluster by augmenting node capacity. As a workaround for this issue, the method to capture node resizing within the cluster entails restarting the Kubelet. -These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization. -However, the limitation of requiring a Kubelet restart for node resizing is an area for potential improvement. +These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization. +However for Vertical scaling, the current implementation does not allow the Kubelet to be aware of the changes made to the compute capacity of the node Node resource hot plugging proves advantageous in scenarios such as: - Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones. - The procedure of establishing new nodes is considerably more time-intensive than expanding the capabilities of current nodes. +- Improved inter-pod network latencies as the inter-node traffic can be reduced if more pods can be hosted on a single node. +- Easier to manage the cluster with fewer nodes, which brings less overhead on the control-plane Implementing this KEP will empower nodes to recognize and adapt to changes in their configurations, thereby facilitating the efficient and effective deployment of pod workloads to nodes capable of meeting the required compute demands. ### Goals -* Achieve seamless node capacity expansion through hot plugging resources, all without necessitating a kubelet restart. +* Achieve seamless node capacity expansion through hot plugging resources. * Facilitate the reinitialization of resource managers (CPU manager, memory manager) to accommodate alterations in the node's resource allocation. ### Non-Goals @@ -130,16 +141,25 @@ Moreover, this KEP aims to refine the initialization and reinitialization proces #### Story 1 -Pinning of workloads to nodes with certain hardware capabilities with limited CPU and memory configurations. - Adopting this KEP will allow nodes with certain hardware capabilities to be resized to accommodate additional workloads that are dependent on particular hardware capability. +As a Kubernetes user, I want to resize nodes with existing specialized hardware (such as GPUs, FPGAs, TPUs, etc.) or CPU Capabilities (for example:https://www.kernel.org/doc/html/v5.8/arm64/elf_hwcaps.html) +to allocate more resources (CPU, memory) so that additional workloads, which depend on this hardware, can be efficiently scheduled and run without manual intervention. #### Story 2 -As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster. +As a Kubernetes Application Developer, I want the kernel to optimize system performance by making better use of local resources when a node is resized, so that my applications run faster with fewer disruptions. This is achieved when there are +Fewer Context Switches: With more CPU cores and memory on a resized node, the kernel has a better chance to spread workloads out efficiently. This can reduce contention between processes, leading to fewer context switches (which can be costly in terms of CPU time) +and less process interference and also reduces latency. +Better Memory Allocation: If the kernel has more memory available, it can allocate larger contiguous memory blocks, which can lead to better memory locality (i.e., keeping related data closer in physical memory), +reducing latency for applications that rely on large datasets, in the case of a database applications. #### Story 3 -As a cluster admin, I must be able to increase the cluster resource capacity without need to restart the kubelet. +As a Site Reliability Engineer (SRE), I want to reduce the operational complexity of managing multiple worker nodes, so that I can focus on fewer resources and simplify troubleshooting and monitoring. + +#### Story 4 + +As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot plug resources without waiting for new nodes to join the cluster. + ### Notes/Constraints/Caveats (Optional) @@ -310,8 +330,7 @@ Existing cluster does not have any impact as the node resources already been upd ##### Downgrade -It's always possible to trivially downgrade to the previous kubelet, It does not have any impact as the future node resource hot plug wont be reflected in cluster -without manual kubelet restart. +It's always possible to trivially downgrade to the previous kubelet, It does not have any impact as the future node resource hot plug wont be reflected in cluster. ### Version Skew Strategy @@ -392,7 +411,7 @@ node configuration the system will continue to work in the same way. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Yes. Once disabled any hot plug of resources won't reflect at the cluster level without kubelet restart. +Yes. Once disabled any hot plug of resources won't reflect at the cluster level. ###### What happens if we reenable the feature if it was previously rolled back? @@ -424,7 +443,7 @@ rollout. Similarly, consider large clusters and how enablement/disablement will rollout across nodes. --> -Rollout may fail if the resource managers are not re-synced properly due to programatic errors. +Rollout may fail if the resource managers are not re-synced properly due to programmatic errors. In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain in the pending state only. Rollback failure should not affect running workloads. @@ -700,9 +719,8 @@ Failure scenarios can occur in cAdvisor level that is if it wrongly updated with ###### What steps should be taken if SLOs are not being met to determine the problem? -If enabling this feature causes performance degradation, its suggested not to hot plug resources and restart the kubelet -to manually to continue operation as before. +If the SLOs are not being met one can examine the kubelet logs and its also advised not to hotplug the node resources. ## Implementation History @@ -722,12 +740,13 @@ Major milestones might include: -Hot Unplug of resource is not supported so any decrease in node resources will be automatically updated but the Pods -re-admission is not done so Pods may be running with low resources until kubelet is restarted. + +Currently, This KEP only focuses on resource hotplug however in a case where the node is downsized its possible that the +nodes capacity may be lower than existing workloads memory requirement. ## Alternatives -Existing and the alternative to this effort would be restarting the kubelet manually each time after the node resize. +Horizontally scale the cluster by incorporating additional compute nodes. Rollout may fail if the resource managers are not re-synced properly due to programmatic errors. -In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain -in the pending state only. +In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain pending. Rollback failure should not affect running workloads. ###### What specific metrics should inform a rollback? @@ -915,7 +913,7 @@ VMs of cluster should support hot plug of compute resources for e2e tests. or if it has to be terminated due to resource crunch. * Recalculate OOM adjust score and Swap limits: * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed. - * Handling unplug of reserved CPUs. + * Handling unplug of reserved and exclusively allocated cpus CPUs. * Fetching machine info via CRI * At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose. From 9db91aec70d5fe9f9d82ce812baf2ab76375a04f Mon Sep 17 00:00:00 2001 From: Karthik Bhat Date: Wed, 14 May 2025 12:13:19 +0530 Subject: [PATCH 16/19] Address review comments --- keps/sig-node/3953-node-resource-hot-plug/README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md index 0d186e69150..7c9b89f5aa5 100644 --- a/keps/sig-node/3953-node-resource-hot-plug/README.md +++ b/keps/sig-node/3953-node-resource-hot-plug/README.md @@ -88,10 +88,11 @@ Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugeP Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running) +Node Compute Resource: CPU, Memory, Swap Capacity and HugePages ## Summary -The proposal seeks to facilitate hot plugging of node compute resources(CPU, Memory, Swap Capacity and HugePages), thereby streamlining cluster resource capacity updates through node compute resource resizing rather than introducing new nodes to the cluster. +The proposal seeks to facilitate hot plugging of node compute resources, thereby streamlining cluster resource capacity updates through node compute resource resizing rather than introducing new nodes to the cluster. The revised node configurations will be automatically propagated at both the node and cluster levels. Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations and @@ -135,7 +136,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th ### Goals * Achieve seamless node capacity expansion through hot plugging resources. -* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager to accommodate alterations in the node's resource allocation. +* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager without reset to accommodate alterations in the node's resource allocation. * Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods. ### Non-Goals @@ -913,7 +914,7 @@ VMs of cluster should support hot plug of compute resources for e2e tests. or if it has to be terminated due to resource crunch. * Recalculate OOM adjust score and Swap limits: * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed. - * Handling unplug of reserved and exclusively allocated cpus CPUs. + * Handling unplug of reserved and exclusively allocated CPUs. * Fetching machine info via CRI * At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose. From 579af1b143c5b779bcffdd1b46a177b28ea23640 Mon Sep 17 00:00:00 2001 From: Karthik Bhat Date: Fri, 16 May 2025 18:49:32 +0530 Subject: [PATCH 17/19] Add CA compatability section --- .../3953-node-resource-hot-plug/README.md | 24 +++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md index 7c9b89f5aa5..07decb3d47d 100644 --- a/keps/sig-node/3953-node-resource-hot-plug/README.md +++ b/keps/sig-node/3953-node-resource-hot-plug/README.md @@ -29,6 +29,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Design Details](#design-details) - [Handling hotplug events](#handling-hotplug-events) - [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers) + - [Compatability with Cluster Autoscaler](#compatability-with-cluster-autoscaler) - [Handling HotUnplug Events](#handling-hotunplug-events) - [Flow Control](#flow-control) - [Test Plan](#test-plan) @@ -135,7 +136,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th ### Goals -* Achieve seamless node capacity expansion through hot plugging resources. +* Achieve seamless node capacity expansion through resource hotplug. * Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager without reset to accommodate alterations in the node's resource allocation. * Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods. @@ -143,7 +144,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th * Dynamically adjust system reserved and kube reserved values. * Hot unplug of node resources. -* Update the autoscaler to utilize resource hot plugging. +* Update the autoscaler to utilize resource hotplug. * Re-balance workloads across the nodes. * Update runtime/NRI plugins with host resource changes. @@ -278,9 +279,9 @@ With increase in cluster resources the following components will be updated: Once the capacity of the node is altered, the following are the sequence of events that occur in the kubelet. If any errors are observed in any of the steps, operation is retried from step 1 along with a `FailedNodeResize` event under the node object. 1. Resizing existing containers: - a.With the increased memory capacity of the nodes, the kubelet proceeds to update fields that are directly related to + a. With the increased memory capacity of the nodes, the kubelet proceeds to update fields that are directly related to the available memory on the host. This would lead to recalculation of oom_score_adj and swap_limits. - b.This is achieved by invoking the CRI API - UpdateContainerResources. + b. This is achieved by invoking the CRI API - UpdateContainerResources. 2. Reinitialise Resource Manager: a. Resource managers such as CPU,Memory are updated with the latest available capacities on the host. This posts the latest @@ -318,6 +319,21 @@ T=1: Resize Instance to Hotplug Memory: Similar flow is applicable for updating oom_score_adj. +#### Compatability with Cluster Autoscaler + +The Cluster Autoscaler (CA) presently anticipates uniform allocatable values among nodes within the same NodeGroup, using existing Nodes as templates for +newly provisioned Nodes from the same NodeGroup. However, with the introduction of NodeResourceHotplug, this assumption may no longer hold true. +If not appropriately addressed, this could cause the Cluster Autoscaler to randomly select a Node from the group and assume identical allocatable values for all upcoming Nodes. +This could lead to suboptimal decisions, such as repeatedly attempting to provision Nodes for pending Pods that are incompatible, or overlooking potential Nodes that could accommodate such Pods. + +To ensure the Cluster Autoscaler acknowledges resource hotplug, the following approaches have been proposed by the Cluster Autoscaler team: +1. Capture Node's Initial Allocatable Values: + * Introduce a new field within the Node object to record initial node allocatable values, which remain unchanged during resource hotplug. + * The Cluster Autoscaler can leverage this field to anticipate potential hotplug of resources, using it as a template for configuring new Nodes. + +2. Identify Nodes Affected by Hotplug: + * By flagging a Node as being impacted by hotplug, the Cluster Autoscaler could revert to a less reliable but more conservative "scale from 0 nodes" logic. + ### Handling HotUnplug Events Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.) From 29835a8806dff90107781bf4c082b349011ff3d9 Mon Sep 17 00:00:00 2001 From: Karthik Bhat Date: Wed, 21 May 2025 16:39:17 +0530 Subject: [PATCH 18/19] Update OOMScoreAdj formula --- .../3953-node-resource-hot-plug/README.md | 49 +++++++++---------- 1 file changed, 24 insertions(+), 25 deletions(-) diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md index 07decb3d47d..289c2eada03 100644 --- a/keps/sig-node/3953-node-resource-hot-plug/README.md +++ b/keps/sig-node/3953-node-resource-hot-plug/README.md @@ -29,7 +29,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Design Details](#design-details) - [Handling hotplug events](#handling-hotplug-events) - [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers) - - [Compatability with Cluster Autoscaler](#compatability-with-cluster-autoscaler) + - [Compatibility with Cluster Autoscaler](#compatibility-with-cluster-autoscaler) - [Handling HotUnplug Events](#handling-hotunplug-events) - [Flow Control](#flow-control) - [Test Plan](#test-plan) @@ -138,7 +138,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th * Achieve seamless node capacity expansion through resource hotplug. * Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager without reset to accommodate alterations in the node's resource allocation. -* Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods. +* Recalculating and updating swap memory limit for existing pods. ### Non-Goals @@ -187,12 +187,6 @@ detect the change in compute capacity, which can bring in additional complicatio ### Risks and Mitigations -- Change in OOMScoreAdjust value: - - The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity` - - With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the - actual OOMScoreAdjust for existing pods. - - This can be mitigated by recalculating the OOMScoreAdjust value for the existing pods. However, there can be an associated overhead for - recalculating the scores. - Change in Swap limit: - The formula to calculate the swap limit is `/)*` - With change in nodeTotalMemory and totalPodsSwapAvailable post up-scale, The existing swap limit may not be inline with the @@ -200,6 +194,17 @@ detect the change in compute capacity, which can bring in additional complicatio - This can be mitigated by recalculating the swap limit for the existing pods. However, there can be an associated overhead for recalculating the scores. +- Change in OOMScoreAdjust value: + - The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity` + - With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the + actual OOMScoreAdjust for existing pods. + - Its not recommended to update the OOMScoreAdjust of a running container as OOMScoreAdjust value is set for init process(pid 1) which is + responsible for running all other container's processes. + - When we update OOMScoreAdjust for a running container, its set for container init only, and possibly processes which will be started later and + running won't get the OOMScoreAdjust new value. + - This can be mitigated by updating the OOMScoreAdj formula to not consider current memory value, hence the new OOMScoreAdj formula looks like this + `min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity)` + - Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload. - To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur. @@ -235,7 +240,7 @@ sequenceDiagram machine-info->>cAdvisor-cache: update cAdvisor-cache->>kubelet: update alt if increase in resource - kubelet->>node: recalculate and update OOMScoreAdj
and Swap limit of containers + kubelet->>node: recalculate and update Swap limit of containers kubelet->>node: re-initialize resource managers kubelet->>node: node status update with new capacity else if decrease in resource @@ -246,7 +251,7 @@ sequenceDiagram The interaction sequence is as follows: 1. Kubelet will fetch machine resource information from cAdvisor's cache, Which is configurable a flag in cAdvisor `update_machine_info_interval`. 2. If the machine resource is increased: - * Recalculate, update OOMScoreAdj and Swap limit of all the running containers. + * Recalculate, update Swap limit of all the running containers. * Re-initialize resource managers. * Update node with new resource. 3. If the machine resource is decreased. @@ -254,21 +259,16 @@ The interaction sequence is as follows: in case there was no history of hotplug.) With increase in cluster resources the following components will be updated: -1. Change in OOM score adjust: - * Currently, the OOM score adjust is calculated by - `1000 - (1000*containerMemReq)/memoryCapacity` - * Increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize and also recalculate the same for existing pods. +1. Change in Swap Memory limit: + * Currently, the swap memory limit is calculated by + `(/)*` + * Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods. -2. Change in Swap Memory limit: - * Currently, the swap memory limit is calculated by - `(/)*` - * Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods. +2. Resource managers are re-initialised. -3. Resource managers are re-initialised. +3. Update in Node capacity. -4. Update in Node capacity. - -5. Scheduler: +4. Scheduler: * Scheduler will automatically schedule any pending pods. * This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler `watches` the available capacity of the node and creates pods accordingly. @@ -287,6 +287,7 @@ observed in any of the steps, operation is retried from step 1 along with a `Fai a. Resource managers such as CPU,Memory are updated with the latest available capacities on the host. This posts the latest available capacities under the node. b. This is achieved by calling ResyncComponents() of ContainerManager interface to re-sync the resource managers. + 3. Updating the node allocatable resources: a. As the scheduler keeps a tab on the available resources of the node, post updating the available capacities, the scheduler proceeds to schedule any pending pods. @@ -317,9 +318,7 @@ T=1: Resize Instance to Hotplug Memory: - /memory.swap.max: 1G ``` -Similar flow is applicable for updating oom_score_adj. - -#### Compatability with Cluster Autoscaler +#### Compatibility with Cluster Autoscaler The Cluster Autoscaler (CA) presently anticipates uniform allocatable values among nodes within the same NodeGroup, using existing Nodes as templates for newly provisioned Nodes from the same NodeGroup. However, with the introduction of NodeResourceHotplug, this assumption may no longer hold true. From 13467acfd96bca05d8d5bc342bd352789314d398 Mon Sep 17 00:00:00 2001 From: Karthik Bhat Date: Fri, 30 May 2025 14:53:59 +0530 Subject: [PATCH 19/19] Address reveiw comments Co-authored-by: kishen-v --- keps/sig-node/3953-node-resource-hot-plug/README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md index 289c2eada03..c6889688ee7 100644 --- a/keps/sig-node/3953-node-resource-hot-plug/README.md +++ b/keps/sig-node/3953-node-resource-hot-plug/README.md @@ -269,9 +269,9 @@ With increase in cluster resources the following components will be updated: 3. Update in Node capacity. 4. Scheduler: - * Scheduler will automatically schedule any pending pods. - * This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler `watches` the - available capacity of the node and creates pods accordingly. + * Scheduler keeps trying to schedule any pending pods. + * The scheduler `watches` the updates to available capacity of the node and schedule pods accordingly. + The scheduler is already doing this today, and this KEP does not require any changes in the scheduler implementation. ### Handling hotplug events @@ -333,6 +333,9 @@ To ensure the Cluster Autoscaler acknowledges resource hotplug, the following ap 2. Identify Nodes Affected by Hotplug: * By flagging a Node as being impacted by hotplug, the Cluster Autoscaler could revert to a less reliable but more conservative "scale from 0 nodes" logic. +Given that this KEP and autoscaler are inter-related, the above approaches were discussed in the community with relevant stakeholders, and have decided approaching this problem through the former route. +The same will be targeted around the beta graduation of this KEP + ### Handling HotUnplug Events Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.)