From ba3a63cde0fc62ec66fd2cd9885d9615544e50a9 Mon Sep 17 00:00:00 2001
From: Karthik K N <karthikkn1997@gmail.com>
Date: Thu, 13 Apr 2023 09:29:19 +0530
Subject: [PATCH 01/19] KEP for dynamic node resize

---
 .../3953-dynamic-node-resize/README.md        | 732 ++++++++++++++++++
 .../3953-dynamic-node-resize/kep.yaml         |  33 +
 2 files changed, 765 insertions(+)
 create mode 100644 keps/sig-node/3953-dynamic-node-resize/README.md
 create mode 100644 keps/sig-node/3953-dynamic-node-resize/kep.yaml
diff --git a/keps/sig-node/3953-dynamic-node-resize/README.md b/keps/sig-node/3953-dynamic-node-resize/README.md
new file mode 100644
index 00000000000..100d067f1c4
--- /dev/null
+++ b/keps/sig-node/3953-dynamic-node-resize/README.md
@@ -0,0 +1,732 @@
+# KEP-3953: Node dynamic resize
+
+<!--
+A table of contents is helpful for quickly jumping to sections of a KEP and for
+highlighting any additional information provided beyond the standard KEP
+template.
+
+Ensure the TOC is wrapped with
+  <code>&lt;!-- toc --&rt;&lt;!-- /toc --&rt;</code>
+tags, and then generate with `hack/update-toc.sh`.
+-->
+
+<!-- toc -->
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+    - [Goals](#goals)
+    - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+    - [User Stories (Optional)](#user-stories-optional)
+        - [Story 1](#story-1)
+        - [Story 2](#story-2)
+    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
+    - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+    - [Test Plan](#test-plan)
+        - [Prerequisite testing updates](#prerequisite-testing-updates)
+        - [Unit tests](#unit-tests)
+        - [Integration tests](#integration-tests)
+        - [e2e tests](#e2e-tests)
+    - [Graduation Criteria](#graduation-criteria)
+    - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+    - [Version Skew Strategy](#version-skew-strategy)
+- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
+    - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+    - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+    - [Monitoring Requirements](#monitoring-requirements)
+    - [Dependencies](#dependencies)
+    - [Scalability](#scalability)
+    - [Troubleshooting](#troubleshooting)
+- [Implementation History](#implementation-history)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
+<!-- /toc -->
+
+## Release Signoff Checklist
+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
+- [ ] (R) KEP approvers have approved the KEP status as `implementable`
+- [ ] (R) Design details are appropriately documented
+- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
+    - [ ] e2e Tests for all Beta API Operations (endpoints)
+    - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
+    - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
+- [ ] (R) Graduation criteria is in place
+    - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
+- [ ] (R) Production readiness review completed
+- [ ] (R) Production readiness review approved
+- [ ] "Implementation History" section is up-to-date for milestone
+- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
+- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
+
+<!--
+**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
+-->
+
+[kubernetes.io]: https://kubernetes.io/
+[kubernetes/enhancements]: https://git.k8s.io/enhancements
+[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
+[kubernetes/website]: https://git.k8s.io/website
+
+## Summary
+
+This proposal aims at enabling dynamic node resizing. This will help in resizing cluster resource capacity by just updating resources of nodes rather than adding new node or removing existing node and 
+also enable node configurations to be reflected at the node and cluster levels automatically without the need to manually resetting the kubelet
+
+This proposal also aims to improvise the initialisation and reinitialisation of resource managers like cpu manager, memory manager with the dynamic change in machine's CPU and memory configurations.
+
+## Motivation
+In a typical Kubernetes environment, the cluster resources may need to be altered because of various reasons like
+- Incorrect resource assignment while creating a cluster.
+- Workload on cluster is increased over time and leading to add more resources to cluster.
+- Workload on cluster is decreased over time and leading to resources under utilization.
+
+To handle these scenarios currently we can 
+- Horizontally scale up or down cluster by the addition or removal of compute nodes
+- Vertically scale up or down cluster by increasing or decreasing the node’s capacity, but the current workaround for the node resize to be captured by the cluster is only by the means of restarting Kubelet.
+
+The dynamic node resize will give advantages in case of scenarios like
+- Handling the resource demand with limited set of machines by increasing the capacity of existing machines rather than creating new ones.
+- Creating/Deleting new machine takes more time when compared to increasing/decreasing the capacity of existing ones.
+
+### Goals
+
+* Dynamically resize the node without restarting the kubelet
+* Add ability to reinitialize resource managers(cpu manager, memory manager) to adopt changes in machine resource
+
+
+### Non-Goals
+
+* Update the autoscaler to utilize dynamic node resize.
+
+## Proposal
+
+This KEP adds a polling mechanism in kubelet to fetch the machine-info using cadvisor, The information will be fetched repeatedly based on configured time interval.
+Later node status updater will take care of updating this information at node level.
+
+This KEP also improvises the resource managers like memory manager, cpu manager initialization and reinitialization so that these resource managers will 
+adapt to the dynamic change in machine configurations. 
+
+### User Stories (Optional)
+
+#### Story 1
+
+As a cluster admin, I want to increase the cluster resource capacity without adding a new node to the cluster.
+
+#### Story 2
+
+As a cluster admin, I want to decrease the cluster resource capacity without removing an existing node from the cluster.
+
+### Notes/Constraints/Caveats (Optional)
+
+
+
+<!--
+What are the caveats to the proposal?
+What are some important details that didn't come across above?
+Go in to as much detail as necessary here.
+This might be a good place to talk about core concepts and how they relate.
+-->
+
+### Risks and Mitigations
+
+<!--
+What are the risks of this proposal, and how do we mitigate? Think broadly.
+For example, consider both security and how this will impact the larger
+Kubernetes ecosystem.
+
+How will security be reviewed, and by whom?
+
+How will UX be reviewed, and by whom?
+
+Consider including folks who also work outside the SIG or subproject.
+-->
+
+## Design Details
+
+Below diagram is shows the interaction between kubelet and cadvisor
+
+```
++----------+                    +-----------+                   +-----------+                  +--------------+
+|          |                    |           |                   |           |                  |              |
+|   node   |                    |  kubelet  |                   |  cadvisor |                  | machine-info |
+|          |                    |           |                   |           |                  |              |
++----+-----+                    +-----+-----+                   +-----+-----+                  +-------+------+
+     |                                |                               |                                |
+     |                                |            poll               |                                |
+     |                                |------------------------------>|                                |
+     |                                |                               |                                |
+     |                                |                               |                                |
+     |                                |                               |             fetch              |
+     |                                |                               |------------------------------->|
+     |                                |                               |                                |
+     |                                |                               |                                |
+     |                                |                               |                                |
+     |                                |                               |             update             |
+     |                                |                               |<-------------------------------|
+     |                                |                               |                                |
+     |                                |            update             |                                |
+     |                                |<------------------------------|                                |
+     |                                |                               |                                |
+     |                                |                               |                                |
+     |                                |                               |                                |
+     |     node status update         |                               |                                |
+     |<-------------------------------|                               |                                |
+     |                                |                               |                                |
+     |                                |                               |                                |
+     |      re-run pod admission      |                               |                                |
+     |<-------------------------------|                               |                                |                                |
+     |                                |                               |                                |      
+     | re-initialize resource managers|                               |                                |
+     |<-------------------------------|                               |                                |                                |
+     |                                |                               |                                |           
+
+```
+
+The interaction sequence is as follows
+1. Kubelet will be polling cadvisor with interval of configured time like one minute to fetch the machine resource information
+2. Cadvisor will fetch and update the machine resource information
+3. kubelet cache will be updated with the latest machine resource information
+4. node status updater will update the node's status with new resource information
+5. In case of shrink in cluster resources will re-run the pod admission to evict pods which lack resources
+6. kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes
+
+Note: In case of increase in cluster resources scheduler will automatically schedule any pending pods
+
+### Test Plan
+
+[x] I/we understand the owners of the involved components may require updates to
+existing tests to make this code solid enough prior to committing the changes necessary
+to implement this enhancement.
+
+##### Prerequisite testing updates
+
+<!--
+Based on reviewers feedback describe what additional tests need to be added prior
+implementing this enhancement to ensure the enhancements have also solid foundations.
+-->
+
+##### Unit tests
+
+<!--
+In principle every added code should have complete unit test coverage, so providing
+the exact set of tests will not bring additional value.
+However, if complete unit test coverage is not possible, explain the reason of it
+together with explanation why this is acceptable.
+-->
+
+<!--
+Additionally, for Alpha try to enumerate the core package you will be touching
+to implement this enhancement and provide the current unit coverage for those
+in the form of:
+- <package>: <date> - <current test coverage>
+The data can be easily read from:
+https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
+
+This can inform certain test coverage improvements that we want to do before
+extending the production code to implement this enhancement.
+-->
+
+- `<package>`: `<date>` - `<test coverage>`
+
+##### Integration tests
+
+<!--
+This question should be filled when targeting a release.
+For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
+
+For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
+https://storage.googleapis.com/k8s-triage/index.html
+-->
+
+- <test>: <link to test coverage>
+
+##### e2e tests
+
+<!--
+This question should be filled when targeting a release.
+For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
+
+For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
+https://storage.googleapis.com/k8s-triage/index.html
+
+We expect no non-infra related flakes in the last month as a GA graduation criteria.
+-->
+
+- <test>: <link to test coverage>
+
+### Graduation Criteria
+
+<!--
+**Note:** *Not required until targeted at a release.*
+
+Define graduation milestones.
+
+These may be defined in terms of API maturity, [feature gate] graduations, or as
+something else. The KEP should keep this high-level with a focus on what
+signals will be looked at to determine graduation.
+
+Consider the following in developing the graduation criteria for this enhancement:
+- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
+- [Feature gate][feature gate] lifecycle
+- [Deprecation policy][deprecation-policy]
+
+Clearly define what graduation means by either linking to the [API doc
+definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
+or by redefining what graduation means.
+
+In general we try to use the same stages (alpha, beta, GA), regardless of how the
+functionality is accessed.
+
+[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
+[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
+[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
+
+Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
+
+#### Alpha
+
+- Feature implemented behind a feature flag
+- Initial e2e tests completed and enabled
+
+#### Beta
+
+- Gather feedback from developers and surveys
+- Complete features A, B, C
+- Additional tests are in Testgrid and linked in KEP
+
+#### GA
+
+- N examples of real-world usage
+- N installs
+- More rigorous forms of testing—e.g., downgrade tests and scalability tests
+- Allowing time for feedback
+
+**Note:** Generally we also wait at least two releases between beta and
+GA/stable, because there's no opportunity for user feedback, or even bug reports,
+in back-to-back releases.
+
+**For non-optional features moving to GA, the graduation criteria must include
+[conformance tests].**
+
+[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
+
+#### Deprecation
+
+- Announce deprecation and support policy of the existing flag
+- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
+- Address feedback on usage/changed behavior, provided on GitHub issues
+- Deprecate the flag
+-->
+
+### Upgrade / Downgrade Strategy
+
+<!--
+If applicable, how will the component be upgraded and downgraded? Make sure
+this is in the test plan.
+
+Consider the following in developing an upgrade/downgrade strategy for this
+enhancement:
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade, in order to maintain previous behavior?
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade, in order to make use of the enhancement?
+-->
+
+### Version Skew Strategy
+
+<!--
+If applicable, how will the component handle version skew with other
+components? What are the guarantees? Make sure this is in the test plan.
+
+Consider the following in developing a version skew strategy for this
+enhancement:
+- Does this enhancement involve coordinating behavior in the control plane and
+  in the kubelet? How does an n-2 kubelet without this feature available behave
+  when this feature is used?
+- Will any other components on the node change? For example, changes to CSI,
+  CRI or CNI may require updating that component before the kubelet.
+-->
+
+## Production Readiness Review Questionnaire
+
+<!--
+
+Production readiness reviews are intended to ensure that features merging into
+Kubernetes are observable, scalable and supportable; can be safely operated in
+production environments, and can be disabled or rolled back in the event they
+cause increased failures in production. See more in the PRR KEP at
+https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness.
+
+The production readiness review questionnaire must be completed and approved
+for the KEP to move to `implementable` status and be included in the release.
+
+In some cases, the questions below should also have answers in `kep.yaml`. This
+is to enable automation to verify the presence of the review, and to reduce review
+burden and latency.
+
+The KEP must have a approver from the
+[`prod-readiness-approvers`](http://git.k8s.io/enhancements/OWNERS_ALIASES)
+team. Please reach out on the
+[#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if
+you need any help or guidance.
+-->
+
+### Feature Enablement and Rollback
+
+<!--
+This section must be completed when targeting alpha to a release.
+-->
+
+###### How can this feature be enabled / disabled in a live cluster?
+
+<!--
+Pick one of these and delete the rest.
+
+Documentation is available on [feature gate lifecycle] and expectations, as
+well as the [existing list] of feature gates.
+
+[feature gate lifecycle]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
+[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
+-->
+
+- [x] Feature gate (also fill in values in `kep.yaml`)
+    - Feature gate name:DynamicNodeResize
+    - Components depending on the feature gate: kubelet
+- [ ] Other
+    - Describe the mechanism:
+    - Will enabling / disabling the feature require downtime of the control
+      plane?
+    - Will enabling / disabling the feature require downtime or reprovisioning
+      of a node?
+
+###### Does enabling the feature change any default behavior?
+
+<!--
+Any change of default behavior may be surprising to users or break existing
+automations, so be extremely careful here.
+-->
+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+<!--
+Describe the consequences on existing workloads (e.g., if this is a runtime
+feature, can it break the existing applications?).
+
+Feature gates are typically disabled by setting the flag to `false` and
+restarting the component. No other changes should be necessary to disable the
+feature.
+
+NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
+-->
+
+###### What happens if we reenable the feature if it was previously rolled back?
+
+###### Are there any tests for feature enablement/disablement?
+
+<!--
+The e2e framework does not currently support enabling or disabling feature
+gates. However, unit tests in each component dealing with managing data, created
+with and without the feature, are necessary. At the very least, think about
+conversion tests if API types are being modified.
+
+Additionally, for features that are introducing a new API field, unit tests that
+are exercising the `switch` of feature gate itself (what happens if I disable a
+feature gate after having objects written with the new field) are also critical.
+You can take a look at one potential example of such test in:
+https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
+-->
+
+### Rollout, Upgrade and Rollback Planning
+
+<!--
+This section must be completed when targeting beta to a release.
+-->
+
+###### How can a rollout or rollback fail? Can it impact already running workloads?
+
+<!--
+Try to be as paranoid as possible - e.g., what if some components will restart
+mid-rollout?
+
+Be sure to consider highly-available clusters, where, for example,
+feature flags will be enabled on some API servers and not others during the
+rollout. Similarly, consider large clusters and how enablement/disablement
+will rollout across nodes.
+-->
+
+###### What specific metrics should inform a rollback?
+
+<!--
+What signals should users be paying attention to when the feature is young
+that might indicate a serious problem?
+-->
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+
+<!--
+Describe manual testing that was done and the outcomes.
+Longer term, we may want to require automated upgrade/rollback tests, but we
+are missing a bunch of machinery and tooling and can't do that now.
+-->
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
+<!--
+Even if applying deprecation policies, they may still surprise some users.
+-->
+
+### Monitoring Requirements
+
+<!--
+This section must be completed when targeting beta to a release.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+-->
+
+###### How can an operator determine if the feature is in use by workloads?
+
+<!--
+Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
+checking if there are objects with field X set) may be a last resort. Avoid
+logs or events for this purpose.
+-->
+
+###### How can someone using this feature know that it is working for their instance?
+
+<!--
+For instance, if this is a pod-related feature, it should be possible to determine if the feature is functioning properly
+for each individual pod.
+Pick one more of these and delete the rest.
+Please describe all items visible to end users below with sufficient detail so that they can verify correct enablement
+and operation of this feature.
+Recall that end users cannot usually observe component logs or access metrics.
+-->
+
+- [ ] Events
+    - Event Reason:
+- [ ] API .status
+    - Condition name:
+    - Other field:
+- [ ] Other (treat as last resort)
+    - Details:
+
+###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
+
+<!--
+This is your opportunity to define what "normal" quality of service looks like
+for a feature.
+
+It's impossible to provide comprehensive guidance, but at the very
+high level (needs more precise definitions) those may be things like:
+  - per-day percentage of API calls finishing with 5XX errors <= 1%
+  - 99% percentile over day of absolute value from (job creation time minus expected
+    job creation time) for cron job <= 10%
+  - 99.9% of /health requests per day finish with 200 code
+
+These goals will help you determine what you need to measure (SLIs) in the next
+question.
+-->
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+<!--
+Pick one more of these and delete the rest.
+-->
+
+- [ ] Metrics
+    - Metric name:
+    - [Optional] Aggregation method:
+    - Components exposing the metric:
+- [ ] Other (treat as last resort)
+    - Details:
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+
+<!--
+Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
+implementation difficulties, etc.).
+-->
+
+### Dependencies
+
+<!--
+This section must be completed when targeting beta to a release.
+-->
+
+###### Does this feature depend on any specific services running in the cluster?
+
+<!--
+Think about both cluster-level services (e.g. metrics-server) as well
+as node-level agents (e.g. specific version of CRI). Focus on external or
+optional services that are needed. For example, if this feature depends on
+a cloud provider API, or upon an external software-defined storage or network
+control plane.
+
+For each of these, fill in the following—thinking about running existing user workloads
+and creating new ones, as well as about cluster-level services (e.g. DNS):
+  - [Dependency name]
+    - Usage description:
+      - Impact of its outage on the feature:
+      - Impact of its degraded performance or high-error rates on the feature:
+-->
+
+### Scalability
+
+<!--
+For alpha, this section is encouraged: reviewers should consider these questions
+and attempt to answer them.
+
+For beta, this section is required: reviewers must answer these questions.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+-->
+
+###### Will enabling / using this feature result in any new API calls?
+
+<!--
+Describe them, providing:
+  - API call type (e.g. PATCH pods)
+  - estimated throughput
+  - originating component(s) (e.g. Kubelet, Feature-X-controller)
+Focusing mostly on:
+  - components listing and/or watching resources they didn't before
+  - API calls that may be triggered by changes of some Kubernetes resources
+    (e.g. update of object X triggers new updates of object Y)
+  - periodic API calls to reconcile state (e.g. periodic fetching state,
+    heartbeats, leader election, etc.)
+-->
+
+###### Will enabling / using this feature result in introducing new API types?
+
+<!--
+Describe them, providing:
+  - API type
+  - Supported number of objects per cluster
+  - Supported number of objects per namespace (for namespace-scoped objects)
+-->
+
+###### Will enabling / using this feature result in any new calls to the cloud provider?
+
+<!--
+Describe them, providing:
+  - Which API(s):
+  - Estimated increase:
+-->
+
+###### Will enabling / using this feature result in increasing size or count of the existing API objects?
+
+<!--
+Describe them, providing:
+  - API type(s):
+  - Estimated increase in size: (e.g., new annotation of size 32B)
+  - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
+-->
+
+###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
+
+<!--
+Look at the [existing SLIs/SLOs].
+
+Think about adding additional work or introducing new steps in between
+(e.g. need to do X to start a container), etc. Please describe the details.
+
+[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
+-->
+
+###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
+
+<!--
+Things to keep in mind include: additional in-memory state, additional
+non-trivial computations, excessive access to disks (including increased log
+volume), significant amount of data sent and/or received over network, etc.
+This through this both in small and large cases, again with respect to the
+[supported limits].
+
+[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
+-->
+
+###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
+
+<!--
+Focus not just on happy cases, but primarily on more pathological cases
+(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
+If any of the resources can be exhausted, how this is mitigated with the existing limits
+(e.g. pods per node) or new limits added by this KEP?
+
+Are there any tests that were run/should be run to understand performance characteristics better
+and validate the declared limits?
+-->
+
+### Troubleshooting
+
+<!--
+This section must be completed when targeting beta to a release.
+
+For GA, this section is required: approvers should be able to confirm the
+previous answers based on experience in the field.
+
+The Troubleshooting section currently serves the `Playbook` role. We may consider
+splitting it into a dedicated `Playbook` document (potentially with some monitoring
+details). For now, we leave it here.
+-->
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+
+###### What are other known failure modes?
+
+<!--
+For each of them, fill in the following information by copying the below template:
+  - [Failure mode brief description]
+    - Detection: How can it be detected via metrics? Stated another way:
+      how can an operator troubleshoot without logging into a master or worker node?
+    - Mitigations: What can be done to stop the bleeding, especially for already
+      running user workloads?
+    - Diagnostics: What are the useful log messages and their required logging
+      levels that could help debug the issue?
+      Not required until feature graduated to beta.
+    - Testing: Are there any tests for failure mode? If not, describe why.
+-->
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
+## Implementation History
+
+<!--
+Major milestones in the lifecycle of a KEP should be tracked in this section.
+Major milestones might include:
+- the `Summary` and `Motivation` sections being merged, signaling SIG acceptance
+- the `Proposal` section being merged, signaling agreement on a proposed design
+- the date implementation started
+- the first Kubernetes release where an initial version of the KEP was available
+- the version of Kubernetes where the KEP graduated to general availability
+- when the KEP was retired or superseded
+-->
+
+## Drawbacks
+
+<!--
+Why should this KEP _not_ be implemented?
+-->
+
+## Alternatives
+
+<!--
+What other approaches did you consider, and why did you rule them out? These do
+not need to be as detailed as the proposal, but should include enough
+information to express the idea and why it was not acceptable.
+-->
+
+## Infrastructure Needed (Optional)
+
+<!--
+Use this section if you need things from the project/SIG. Examples include a
+new subproject, repos requested, or GitHub details. Listing these here allows a
+SIG to get the process for these resources started right away.
+-->
diff --git a/keps/sig-node/3953-dynamic-node-resize/kep.yaml b/keps/sig-node/3953-dynamic-node-resize/kep.yaml
new file mode 100644
index 00000000000..bf291d4d897
--- /dev/null
+++ b/keps/sig-node/3953-dynamic-node-resize/kep.yaml
@@ -0,0 +1,33 @@
+title: Dynamic node resize
+kep-number: 3953
+authors:
+  - "@Karthik-K-N"
+  - "@mkumatag"
+  - "@kishen-v"
+owning-sig: sig-node
+participating-sigs:
+  - sig-node
+status: provisional
+creation-date: 2023-10-04
+reviewers:
+  - "@smarterclayton"
+  - "@ffromani"
+  - "@SergeyKanzhelev"
+approvers:
+  - "@sig-node-leads"
+see-also:
+
+stage: "alpha"
+
+latest-milestone: "v1.28"
+
+milestone:
+  alpha: ""
+  beta: ""
+  stable: ""
+
+feature-gates:
+  - name: DynamicNodeResize
+    components:
+      - kubelet
+disable-supported: true

From 1501b40f3fbb08d2ea8f0e88b52bc0dcba536393 Mon Sep 17 00:00:00 2001
From: Kishen V <kishen.viswanathan@ibm.com>
Date: Thu, 13 Apr 2023 17:23:00 +0530
Subject: [PATCH 02/19] Fix KEP doc for node-resize.

---
 .../3953-dynamic-node-resize/README.md        | 190 +++++++++++-------
 .../3953-dynamic-node-resize/kep.yaml         |   6 -
 2 files changed, 117 insertions(+), 79 deletions(-)

diff --git a/keps/sig-node/3953-dynamic-node-resize/README.md b/keps/sig-node/3953-dynamic-node-resize/README.md
index 100d067f1c4..779c2df2e8e 100644
--- a/keps/sig-node/3953-dynamic-node-resize/README.md
+++ b/keps/sig-node/3953-dynamic-node-resize/README.md
@@ -14,30 +14,30 @@ tags, and then generate with `hack/update-toc.sh`.
 - [Release Signoff Checklist](#release-signoff-checklist)
 - [Summary](#summary)
 - [Motivation](#motivation)
-    - [Goals](#goals)
-    - [Non-Goals](#non-goals)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
-    - [User Stories (Optional)](#user-stories-optional)
-        - [Story 1](#story-1)
-        - [Story 2](#story-2)
-    - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
-    - [Risks and Mitigations](#risks-and-mitigations)
+  - [User Stories (Optional)](#user-stories-optional)
+    - [Story 1](#story-1)
+    - [Story 2](#story-2)
+  - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
+  - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
-    - [Test Plan](#test-plan)
-        - [Prerequisite testing updates](#prerequisite-testing-updates)
-        - [Unit tests](#unit-tests)
-        - [Integration tests](#integration-tests)
-        - [e2e tests](#e2e-tests)
-    - [Graduation Criteria](#graduation-criteria)
-    - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
-    - [Version Skew Strategy](#version-skew-strategy)
+  - [Test Plan](#test-plan)
+      - [Prerequisite testing updates](#prerequisite-testing-updates)
+      - [Unit tests](#unit-tests)
+      - [Integration tests](#integration-tests)
+      - [e2e tests](#e2e-tests)
+  - [Graduation Criteria](#graduation-criteria)
+  - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+  - [Version Skew Strategy](#version-skew-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
-    - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
-    - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
-    - [Monitoring Requirements](#monitoring-requirements)
-    - [Dependencies](#dependencies)
-    - [Scalability](#scalability)
-    - [Troubleshooting](#troubleshooting)
+  - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
+  - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
@@ -74,30 +74,29 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-This proposal aims at enabling dynamic node resizing. This will help in resizing cluster resource capacity by just updating resources of nodes rather than adding new node or removing existing node and 
-also enable node configurations to be reflected at the node and cluster levels automatically without the need to manually resetting the kubelet
+The proposal aims at enabling dynamic node resizing. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node or removing existing node from a cluster.
+The updated node configurations are to be reflected at the node and cluster levels automatically without the need to reset the kubelet.
 
-This proposal also aims to improvise the initialisation and reinitialisation of resource managers like cpu manager, memory manager with the dynamic change in machine's CPU and memory configurations.
+This proposal also aims to improve the initialization and reinitialization of resource managers, such as the CPU manager and memory manager, in response to changes in a node's CPU and memory configurations.
 
 ## Motivation
-In a typical Kubernetes environment, the cluster resources may need to be altered because of various reasons like
-- Incorrect resource assignment while creating a cluster.
-- Workload on cluster is increased over time and leading to add more resources to cluster.
-- Workload on cluster is decreased over time and leading to resources under utilization.
+In a typical Kubernetes environment, the cluster resources may need to be altered due to following reasons:
+- Incorrect resource assignment during cluster creation.
+- Increased workload over time, leading to the need for additional resources in the cluster.
+- Decreased workload over time, leading to resource underutilization in the cluster.
 
-To handle these scenarios currently we can 
-- Horizontally scale up or down cluster by the addition or removal of compute nodes
-- Vertically scale up or down cluster by increasing or decreasing the node’s capacity, but the current workaround for the node resize to be captured by the cluster is only by the means of restarting Kubelet.
+To handle these scenarios, we can:
+- Horizontally scale up or down the cluster by adding or removing compute nodes.
+- Vertically scale up or down the cluster by increasing or decreasing node capacity. However, currently, the workaround for capturing node resizing in the cluster involves restarting the Kubelet.
 
-The dynamic node resize will give advantages in case of scenarios like
-- Handling the resource demand with limited set of machines by increasing the capacity of existing machines rather than creating new ones.
-- Creating/Deleting new machine takes more time when compared to increasing/decreasing the capacity of existing ones.
+Dynamic node resizing will provide advantages in scenarios such as:
+- Handling resource demand with a limited set of nodes by increasing the capacity of existing nodes instead of creating new nodes.
+- Creating or deleting new nodes takes more time compared to increasing or decreasing the capacity of existing nodes.
 
 ### Goals
 
-* Dynamically resize the node without restarting the kubelet
-* Add ability to reinitialize resource managers(cpu manager, memory manager) to adopt changes in machine resource
-
+* Dynamically resize the node without restarting the kubelet.
+* Ability to reinitialize resource managers (CPU manager, memory manager) to adopt changes in node's resource.
 
 ### Non-Goals
 
@@ -105,21 +104,19 @@ The dynamic node resize will give advantages in case of scenarios like
 
 ## Proposal
 
-This KEP adds a polling mechanism in kubelet to fetch the machine-info using cadvisor, The information will be fetched repeatedly based on configured time interval.
-Later node status updater will take care of updating this information at node level.
+This KEP adds a polling mechanism in kubelet to fetch the machine-information from cAdvisor's cache, The information will be fetched periodically based on a configured time interval, after which the node status updater is responsible for updating this information at node level in the cluster.
 
-This KEP also improvises the resource managers like memory manager, cpu manager initialization and reinitialization so that these resource managers will 
-adapt to the dynamic change in machine configurations. 
+Additionally, this KEP aims to improve the initialization and reinitialization of resource managers, such as the memory manager and CPU manager, so that they can adapt to change in node's configurations.
 
 ### User Stories (Optional)
 
 #### Story 1
 
-As a cluster admin, I want to increase the cluster resource capacity without adding a new node to the cluster.
+As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster.
 
 #### Story 2
 
-As a cluster admin, I want to decrease the cluster resource capacity without removing an existing node from the cluster.
+As a cluster admin, I must be able to decrease the cluster resource capacity without removing an existing node from the cluster.
 
 ### Notes/Constraints/Caveats (Optional)
 
@@ -148,13 +145,13 @@ Consider including folks who also work outside the SIG or subproject.
 
 ## Design Details
 
-Below diagram is shows the interaction between kubelet and cadvisor
+Below diagram is shows the interaction between kubelet and cAdvisor.
 
 ```
 +----------+                    +-----------+                   +-----------+                  +--------------+
 |          |                    |           |                   |           |                  |              |
-|   node   |                    |  kubelet  |                   |  cadvisor |                  | machine-info |
-|          |                    |           |                   |           |                  |              |
+|   node   |                    |  kubelet  |                   |  cAdvisor |                  | machine-info |
+|          |                    |           |                   |   cache   |                  |              |
 +----+-----+                    +-----+-----+                   +-----+-----+                  +-------+------+
      |                                |                               |                                |
      |                                |            poll               |                                |
@@ -177,7 +174,7 @@ Below diagram is shows the interaction between kubelet and cadvisor
      |     node status update         |                               |                                |
      |<-------------------------------|                               |                                |
      |                                |                               |                                |
-     |                                |                               |                                |
+     |      if shrink in resource     |                               |                                |
      |      re-run pod admission      |                               |                                |
      |<-------------------------------|                               |                                |                                |
      |                                |                               |                                |      
@@ -188,14 +185,76 @@ Below diagram is shows the interaction between kubelet and cadvisor
 ```
 
 The interaction sequence is as follows
-1. Kubelet will be polling cadvisor with interval of configured time like one minute to fetch the machine resource information
-2. Cadvisor will fetch and update the machine resource information
-3. kubelet cache will be updated with the latest machine resource information
-4. node status updater will update the node's status with new resource information
-5. In case of shrink in cluster resources will re-run the pod admission to evict pods which lack resources
-6. kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes
+1. Kubelet will be polling in interval of configured time to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes.
+3. Kubelet's cache will be updated with the latest machine resource information.
+4. Node status updater will update the node's status with the latest resource information.
+5. In case of a shrink in cluster resources rerun the pod admission and the pod admission will evict pods
+6. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes.
+
+Note: In case of increase in cluster resources, the scheduler will automatically schedule any pending pods.
+
+**Kubelet Configuration changes**
+
+* A new boolean variable `dynamicNodeResize` will be added to kubelet configuration.
+* `dynamicNodeResize` will be false by default.
+* User need to set `dynamicNodeResize` to true make use of Dynamic Node Resize.
+
+**Proposed Code changes**
+
+**Dynamic Node resize and Pod Re-admission logic**
+
+```azure
+	if kl.kubeletConfiguration.DynamicNodeResize {
+		// Handle the node dynamic resize
+		machineInfo, err := kl.cadvisor.MachineInfo()
+		if err != nil {
+			klog.ErrorS(err, "Error fetching machine info")
+		} else {
+			cachedMachineInfo, _ := kl.GetCachedMachineInfo()
+
+			if !reflect.DeepEqual(cachedMachineInfo, machineInfo) {
+				kl.setCachedMachineInfo(machineInfo)
+
+				// Resync the resource managers
+				if err := kl.ResyncComponents(machineInfo); err != nil {
+					klog.ErrorS(err, "Error resyncing the kubelet components with machine info")
+				}
+
+				//Rerun pod admission only in case of shrink in cluster resources
+				if machineInfo.NumCores < cachedMachineInfo.NumCores || machineInfo.MemoryCapacity < cachedMachineInfo.MemoryCapacity {
+					klog.InfoS("Observed shrink in nod resources, rerunning pod admission")
+					kl.HandlePodAdditions(activePods)
+				}
+			}
+		}
+	}
+```
+
+**Changes to resource managers to adapt to dynamic resize**
+
+1. Adding ResyncComponents() method to ContainerManager interface
+```azure
+    // Manages the containers running on a machine.
+    type ContainerManager interface {
+        .
+        .
+        // ResyncComponents will resyc the resource managers like cpu, memory and topology managers
+	// with updated machineInfo
+	ResyncComponents(machineInfo *cadvisorapi.MachineInfo) error
+	.
+	.
+    )
+```
+
+2. Adding a method Sync to all the resource managers and will be invoked once there is dynamic resource change.
+
+```azure
+        // Sync will sync the CPU Manager with the latest machine info
+	Sync(machineInfo *cadvisorapi.MachineInfo) error
+```
+
 
-Note: In case of increase in cluster resources scheduler will automatically schedule any pending pods
+Note: PoC code changes: https://github.com/kubernetes/kubernetes/pull/115755
 
 ### Test Plan
 
@@ -212,26 +271,11 @@ implementing this enhancement to ensure the enhancements have also solid foundat
 
 ##### Unit tests
 
-<!--
-In principle every added code should have complete unit test coverage, so providing
-the exact set of tests will not bring additional value.
-However, if complete unit test coverage is not possible, explain the reason of it
-together with explanation why this is acceptable.
--->
-
-<!--
-Additionally, for Alpha try to enumerate the core package you will be touching
-to implement this enhancement and provide the current unit coverage for those
-in the form of:
-- <package>: <date> - <current test coverage>
-The data can be easily read from:
-https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit
-
-This can inform certain test coverage improvements that we want to do before
-extending the production code to implement this enhancement.
--->
+1. Add necessary tests in kubelet_node_status_test.go to check for the node status behaviour with dynamic node resize.
+2. Add necessary tests in kubelet_pods_test.go to check for the pod cleanup and pod addition workflow.
+3. Add necessary tests in eventhandlers_test.go to check for scheduler behaviour with dynamic node capacity change.
+4. Add necessary tests in resource managers to check for managers behaviour to adopt dynamic node capacity change.
 
-- `<package>`: `<date>` - `<test coverage>`
 
 ##### Integration tests
 
diff --git a/keps/sig-node/3953-dynamic-node-resize/kep.yaml b/keps/sig-node/3953-dynamic-node-resize/kep.yaml
index bf291d4d897..edd968e005e 100644
--- a/keps/sig-node/3953-dynamic-node-resize/kep.yaml
+++ b/keps/sig-node/3953-dynamic-node-resize/kep.yaml
@@ -25,9 +25,3 @@ milestone:
   alpha: ""
   beta: ""
   stable: ""
-
-feature-gates:
-  - name: DynamicNodeResize
-    components:
-      - kubelet
-disable-supported: true

From 16af5dc2eea1716a47417ac1edbd5ba1b3c876aa Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Tue, 10 Sep 2024 16:28:12 +0530
Subject: [PATCH 03/19] Update to emphasis on scale up of resoures

---
 .../3953-dynamic-node-resize/README.md        | 91 ++++++-------------
 .../3953-dynamic-node-resize/kep.yaml         |  2 +-
 2 files changed, 31 insertions(+), 62 deletions(-)

diff --git a/keps/sig-node/3953-dynamic-node-resize/README.md b/keps/sig-node/3953-dynamic-node-resize/README.md
index 779c2df2e8e..678d69821dc 100644
--- a/keps/sig-node/3953-dynamic-node-resize/README.md
+++ b/keps/sig-node/3953-dynamic-node-resize/README.md
@@ -74,7 +74,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-The proposal aims at enabling dynamic node resizing. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node or removing existing node from a cluster.
+The proposal aims at enabling dynamic node resizing. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node to a cluster.
 The updated node configurations are to be reflected at the node and cluster levels automatically without the need to reset the kubelet.
 
 This proposal also aims to improve the initialization and reinitialization of resource managers, such as the CPU manager and memory manager, in response to changes in a node's CPU and memory configurations.
@@ -83,15 +83,14 @@ This proposal also aims to improve the initialization and reinitialization of re
 In a typical Kubernetes environment, the cluster resources may need to be altered due to following reasons:
 - Incorrect resource assignment during cluster creation.
 - Increased workload over time, leading to the need for additional resources in the cluster.
-- Decreased workload over time, leading to resource underutilization in the cluster.
 
 To handle these scenarios, we can:
-- Horizontally scale up or down the cluster by adding or removing compute nodes.
-- Vertically scale up or down the cluster by increasing or decreasing node capacity. However, currently, the workaround for capturing node resizing in the cluster involves restarting the Kubelet.
+- Horizontally scale up the cluster by adding compute nodes.
+- Vertically scale up the cluster by increasing node capacity. However, currently, the workaround for capturing node resizing in the cluster involves restarting the Kubelet.
 
 Dynamic node resizing will provide advantages in scenarios such as:
 - Handling resource demand with a limited set of nodes by increasing the capacity of existing nodes instead of creating new nodes.
-- Creating or deleting new nodes takes more time compared to increasing or decreasing the capacity of existing nodes.
+- Creating new nodes takes more time compared to increasing the capacity of existing nodes.
 
 ### Goals
 
@@ -101,9 +100,13 @@ Dynamic node resizing will provide advantages in scenarios such as:
 ### Non-Goals
 
 * Update the autoscaler to utilize dynamic node resize.
+* Dynamically adjust system reserved and kube reserved values.
 
 ## Proposal
 
+This KEP aims to support the dynamic resize of compute resources of node with dynamic scale up of resources.
+Dynamic scale down of resources will be proposed in separate KEP in future.
+
 This KEP adds a polling mechanism in kubelet to fetch the machine-information from cAdvisor's cache, The information will be fetched periodically based on a configured time interval, after which the node status updater is responsible for updating this information at node level in the cluster.
 
 Additionally, this KEP aims to improve the initialization and reinitialization of resource managers, such as the memory manager and CPU manager, so that they can adapt to change in node's configurations.
@@ -114,10 +117,6 @@ Additionally, this KEP aims to improve the initialization and reinitialization o
 
 As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster.
 
-#### Story 2
-
-As a cluster admin, I must be able to decrease the cluster resource capacity without removing an existing node from the cluster.
-
 ### Notes/Constraints/Caveats (Optional)
 
 
@@ -145,66 +144,42 @@ Consider including folks who also work outside the SIG or subproject.
 
 ## Design Details
 
+
 Below diagram is shows the interaction between kubelet and cAdvisor.
 
-```
-+----------+                    +-----------+                   +-----------+                  +--------------+
-|          |                    |           |                   |           |                  |              |
-|   node   |                    |  kubelet  |                   |  cAdvisor |                  | machine-info |
-|          |                    |           |                   |   cache   |                  |              |
-+----+-----+                    +-----+-----+                   +-----+-----+                  +-------+------+
-     |                                |                               |                                |
-     |                                |            poll               |                                |
-     |                                |------------------------------>|                                |
-     |                                |                               |                                |
-     |                                |                               |                                |
-     |                                |                               |             fetch              |
-     |                                |                               |------------------------------->|
-     |                                |                               |                                |
-     |                                |                               |                                |
-     |                                |                               |                                |
-     |                                |                               |             update             |
-     |                                |                               |<-------------------------------|
-     |                                |                               |                                |
-     |                                |            update             |                                |
-     |                                |<------------------------------|                                |
-     |                                |                               |                                |
-     |                                |                               |                                |
-     |                                |                               |                                |
-     |     node status update         |                               |                                |
-     |<-------------------------------|                               |                                |
-     |                                |                               |                                |
-     |      if shrink in resource     |                               |                                |
-     |      re-run pod admission      |                               |                                |
-     |<-------------------------------|                               |                                |                                |
-     |                                |                               |                                |      
-     | re-initialize resource managers|                               |                                |
-     |<-------------------------------|                               |                                |                                |
-     |                                |                               |                                |           
 
+```mermaid
+sequenceDiagram
+    participant node
+    participant kubelet
+    participant cAdvisor-cache
+    participant machine-info
+    kubelet->>cAdvisor-cache: Poll
+    cAdvisor-cache->>machine-info: fetch
+    machine-info->>cAdvisor-cache: update
+    cAdvisor-cache->>kubelet: update
+    kubelet->>node: node status update
+    kubelet->>node: re-initialize resource managers
 ```
 
 The interaction sequence is as follows
-1. Kubelet will be polling in interval of configured time to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes.
-3. Kubelet's cache will be updated with the latest machine resource information.
-4. Node status updater will update the node's status with the latest resource information.
-5. In case of a shrink in cluster resources rerun the pod admission and the pod admission will evict pods
-6. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes.
+1. Kubelet will be polling in interval to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes.
+2. Kubelet's cache will be updated with the latest machine resource information.
+3. Node status updater will update the node's status with the latest resource information.
+4. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes.
 
 Note: In case of increase in cluster resources, the scheduler will automatically schedule any pending pods.
 
 **Kubelet Configuration changes**
 
-* A new boolean variable `dynamicNodeResize` will be added to kubelet configuration.
-* `dynamicNodeResize` will be false by default.
-* User need to set `dynamicNodeResize` to true make use of Dynamic Node Resize.
+* Add a variable to configure the interval to fetch the updated machine information.
 
 **Proposed Code changes**
 
 **Dynamic Node resize and Pod Re-admission logic**
 
-```azure
-	if kl.kubeletConfiguration.DynamicNodeResize {
+```go
+	if utilfeature.DefaultFeatureGate.Enabled(features.DynamicNodeResize) {
 		// Handle the node dynamic resize
 		machineInfo, err := kl.cadvisor.MachineInfo()
 		if err != nil {
@@ -219,12 +194,6 @@ Note: In case of increase in cluster resources, the scheduler will automatically
 				if err := kl.ResyncComponents(machineInfo); err != nil {
 					klog.ErrorS(err, "Error resyncing the kubelet components with machine info")
 				}
-
-				//Rerun pod admission only in case of shrink in cluster resources
-				if machineInfo.NumCores < cachedMachineInfo.NumCores || machineInfo.MemoryCapacity < cachedMachineInfo.MemoryCapacity {
-					klog.InfoS("Observed shrink in nod resources, rerunning pod admission")
-					kl.HandlePodAdditions(activePods)
-				}
 			}
 		}
 	}
@@ -233,7 +202,7 @@ Note: In case of increase in cluster resources, the scheduler will automatically
 **Changes to resource managers to adapt to dynamic resize**
 
 1. Adding ResyncComponents() method to ContainerManager interface
-```azure
+```go
     // Manages the containers running on a machine.
     type ContainerManager interface {
         .
@@ -248,7 +217,7 @@ Note: In case of increase in cluster resources, the scheduler will automatically
 
 2. Adding a method Sync to all the resource managers and will be invoked once there is dynamic resource change.
 
-```azure
+```go
         // Sync will sync the CPU Manager with the latest machine info
 	Sync(machineInfo *cadvisorapi.MachineInfo) error
 ```
diff --git a/keps/sig-node/3953-dynamic-node-resize/kep.yaml b/keps/sig-node/3953-dynamic-node-resize/kep.yaml
index edd968e005e..ea02aca8620 100644
--- a/keps/sig-node/3953-dynamic-node-resize/kep.yaml
+++ b/keps/sig-node/3953-dynamic-node-resize/kep.yaml
@@ -19,7 +19,7 @@ see-also:
 
 stage: "alpha"
 
-latest-milestone: "v1.28"
+latest-milestone: "v1.32"
 
 milestone:
   alpha: ""

From 6f55c96b6295c3c1ce88345ecf2af23c79155413 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Mon, 13 Jan 2025 19:46:43 +0530
Subject: [PATCH 04/19] Rename the KEP to match to the updated scope

---
 .../README.md                                 | 286 ++++++------------
 .../kep.yaml                                  |   5 +-
 2 files changed, 103 insertions(+), 188 deletions(-)
 rename keps/sig-node/{3953-dynamic-node-resize => 3953-node-resource-hot-plug}/README.md (74%)
 rename keps/sig-node/{3953-dynamic-node-resize => 3953-node-resource-hot-plug}/kep.yaml (82%)

diff --git a/keps/sig-node/3953-dynamic-node-resize/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
similarity index 74%
rename from keps/sig-node/3953-dynamic-node-resize/README.md
rename to keps/sig-node/3953-node-resource-hot-plug/README.md
index 678d69821dc..44e2fc07ee2 100644
--- a/keps/sig-node/3953-dynamic-node-resize/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -1,4 +1,4 @@
-# KEP-3953: Node dynamic resize
+# KEP-3953: Node Resource Hot Plug
 
 <!--
 A table of contents is helpful for quickly jumping to sections of a KEP and for
@@ -19,14 +19,11 @@ tags, and then generate with `hack/update-toc.sh`.
 - [Proposal](#proposal)
   - [User Stories (Optional)](#user-stories-optional)
     - [Story 1](#story-1)
-    - [Story 2](#story-2)
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
   - [Test Plan](#test-plan)
-      - [Prerequisite testing updates](#prerequisite-testing-updates)
       - [Unit tests](#unit-tests)
-      - [Integration tests](#integration-tests)
       - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
@@ -74,7 +71,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-The proposal aims at enabling dynamic node resizing. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node to a cluster.
+The proposal aims at enabling hot plugging of node compute resources. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node to a cluster.
 The updated node configurations are to be reflected at the node and cluster levels automatically without the need to reset the kubelet.
 
 This proposal also aims to improve the initialization and reinitialization of resource managers, such as the CPU manager and memory manager, in response to changes in a node's CPU and memory configurations.
@@ -88,59 +85,49 @@ To handle these scenarios, we can:
 - Horizontally scale up the cluster by adding compute nodes.
 - Vertically scale up the cluster by increasing node capacity. However, currently, the workaround for capturing node resizing in the cluster involves restarting the Kubelet.
 
-Dynamic node resizing will provide advantages in scenarios such as:
+Node resource hot plugging will provide advantages in scenarios such as:
 - Handling resource demand with a limited set of nodes by increasing the capacity of existing nodes instead of creating new nodes.
 - Creating new nodes takes more time compared to increasing the capacity of existing nodes.
 
 ### Goals
 
-* Dynamically resize the node without restarting the kubelet.
+* Dynamically scale up the node by hot plugging  resources and without restarting the kubelet.
 * Ability to reinitialize resource managers (CPU manager, memory manager) to adopt changes in node's resource.
 
 ### Non-Goals
 
-* Update the autoscaler to utilize dynamic node resize.
 * Dynamically adjust system reserved and kube reserved values.
+* Hot unplug of node resources.
+* Update the autoscaler to utilize resource hot plugging.
 
-## Proposal
 
-This KEP aims to support the dynamic resize of compute resources of node with dynamic scale up of resources.
-Dynamic scale down of resources will be proposed in separate KEP in future.
+## Proposal
 
-This KEP adds a polling mechanism in kubelet to fetch the machine-information from cAdvisor's cache, The information will be fetched periodically based on a configured time interval, after which the node status updater is responsible for updating this information at node level in the cluster.
+This KEP aims to support the node resource hot plugging by adding a polling mechanism in kubelet to fetch the machine-information from cAdvisor's cache which is already updated periodically, This information will be fetched periodically by kubelet, after which the node status updater is responsible for updating this information at node level in the cluster.
 
 Additionally, this KEP aims to improve the initialization and reinitialization of resource managers, such as the memory manager and CPU manager, so that they can adapt to change in node's configurations.
 
-### User Stories (Optional)
+### User Stories
 
 #### Story 1
 
 As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster.
 
-### Notes/Constraints/Caveats (Optional)
+#### Story 2
 
+As a cluster admin, I must be able to increase the cluster resource capacity without need to restarting the kubelet.
 
-
-<!--
-What are the caveats to the proposal?
-What are some important details that didn't come across above?
-Go in to as much detail as necessary here.
-This might be a good place to talk about core concepts and how they relate.
--->
+### Notes/Constraints/Caveats (Optional)
 
 ### Risks and Mitigations
 
-<!--
-What are the risks of this proposal, and how do we mitigate? Think broadly.
-For example, consider both security and how this will impact the larger
-Kubernetes ecosystem.
-
-How will security be reviewed, and by whom?
-
-How will UX be reviewed, and by whom?
-
-Consider including folks who also work outside the SIG or subproject.
--->
+1. Node resource hot plugging is an opt-in feature, merging the
+   feature related changes won't impact existing workloads. Moreover, the feature
+   will be rolled out gradually, beginning with an alpha release for testing and
+   gathering feedback. This will be followed by beta and GA releases as the
+   feature matures and potential problems and improvements are addressed.
+2. Though the node resource is updated dynamically, the dynamic data is fetched from cAdvisor and its well integrated with kubelet.
+3. Resource manager are updated to adapt to the dynamic node reconfigurations, Enough tests should be added to make sure its not affecting the existing functionalities.
 
 ## Design Details
 
@@ -170,17 +157,13 @@ The interaction sequence is as follows
 
 Note: In case of increase in cluster resources, the scheduler will automatically schedule any pending pods.
 
-**Kubelet Configuration changes**
-
-* Add a variable to configure the interval to fetch the updated machine information.
-
 **Proposed Code changes**
 
-**Dynamic Node resize and Pod Re-admission logic**
+**Dynamic Node Scale Up logic**
 
 ```go
-	if utilfeature.DefaultFeatureGate.Enabled(features.DynamicNodeResize) {
-		// Handle the node dynamic resize
+	if utilfeature.DefaultFeatureGate.Enabled(features.NodeResourceHotPlug) {
+		// Handle the node dynamic scale up
 		machineInfo, err := kl.cadvisor.MachineInfo()
 		if err != nil {
 			klog.ErrorS(err, "Error fetching machine info")
@@ -199,7 +182,7 @@ Note: In case of increase in cluster resources, the scheduler will automatically
 	}
 ```
 
-**Changes to resource managers to adapt to dynamic resize**
+**Changes to resource managers to adapt to dynamic scale up of resources**
 
 1. Adding ResyncComponents() method to ContainerManager interface
 ```go
@@ -222,119 +205,42 @@ Note: In case of increase in cluster resources, the scheduler will automatically
 	Sync(machineInfo *cadvisorapi.MachineInfo) error
 ```
 
-
-Note: PoC code changes: https://github.com/kubernetes/kubernetes/pull/115755
-
 ### Test Plan
 
 [x] I/we understand the owners of the involved components may require updates to
 existing tests to make this code solid enough prior to committing the changes necessary
 to implement this enhancement.
 
-##### Prerequisite testing updates
-
-<!--
-Based on reviewers feedback describe what additional tests need to be added prior
-implementing this enhancement to ensure the enhancements have also solid foundations.
--->
-
 ##### Unit tests
 
-1. Add necessary tests in kubelet_node_status_test.go to check for the node status behaviour with dynamic node resize.
+1. Add necessary tests in kubelet_node_status_test.go to check for the node status behaviour with dynamic node scale up.
 2. Add necessary tests in kubelet_pods_test.go to check for the pod cleanup and pod addition workflow.
 3. Add necessary tests in eventhandlers_test.go to check for scheduler behaviour with dynamic node capacity change.
 4. Add necessary tests in resource managers to check for managers behaviour to adopt dynamic node capacity change.
 
 
-##### Integration tests
-
-<!--
-This question should be filled when targeting a release.
-For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
-
-For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
-https://storage.googleapis.com/k8s-triage/index.html
--->
-
-- <test>: <link to test coverage>
-
 ##### e2e tests
 
-<!--
-This question should be filled when targeting a release.
-For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
-
-For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
-https://storage.googleapis.com/k8s-triage/index.html
-
-We expect no non-infra related flakes in the last month as a GA graduation criteria.
--->
+Following scenarios need to be covered:
 
-- <test>: <link to test coverage>
+* Node resource information before and after resource hot plug.
+* State of Pending pods due to lack of resources after resource hot plug.
+* Resource manager states after the resynch of components.
 
 ### Graduation Criteria
 
-<!--
-**Note:** *Not required until targeted at a release.*
-
-Define graduation milestones.
-
-These may be defined in terms of API maturity, [feature gate] graduations, or as
-something else. The KEP should keep this high-level with a focus on what
-signals will be looked at to determine graduation.
-
-Consider the following in developing the graduation criteria for this enhancement:
-- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
-- [Feature gate][feature gate] lifecycle
-- [Deprecation policy][deprecation-policy]
-
-Clearly define what graduation means by either linking to the [API doc
-definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning)
-or by redefining what graduation means.
-
-In general we try to use the same stages (alpha, beta, GA), regardless of how the
-functionality is accessed.
-
-[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md
-[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
-[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
-
-Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels].
-
-#### Alpha
 
-- Feature implemented behind a feature flag
-- Initial e2e tests completed and enabled
+#### Phase 1: Alpha (target 1.33)
 
-#### Beta
 
-- Gather feedback from developers and surveys
-- Complete features A, B, C
-- Additional tests are in Testgrid and linked in KEP
+* Feature is disabled by default. It is an opt-in feature which can be enabled by enabling the `NodeResourceHotPlug`
+  feature gate.
+* Support the basic functionality for scheduler to consider pod-level resource requests to find a suitable node.
+* Support the basic functionality for kubelet to translate pod-level requests/limits to pod-level cgroup settings.
+* Unit test coverage.
+* E2E tests.
+* Documentation mentioning high level design.
 
-#### GA
-
-- N examples of real-world usage
-- N installs
-- More rigorous forms of testing—e.g., downgrade tests and scalability tests
-- Allowing time for feedback
-
-**Note:** Generally we also wait at least two releases between beta and
-GA/stable, because there's no opportunity for user feedback, or even bug reports,
-in back-to-back releases.
-
-**For non-optional features moving to GA, the graduation criteria must include
-[conformance tests].**
-
-[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md
-
-#### Deprecation
-
-- Announce deprecation and support policy of the existing flag
-- Two versions passed since introducing the functionality that deprecates the flag (to address version skew)
-- Address feedback on usage/changed behavior, provided on GitHub issues
-- Deprecate the flag
--->
 
 ### Upgrade / Downgrade Strategy
 
@@ -408,7 +314,7 @@ well as the [existing list] of feature gates.
 -->
 
 - [x] Feature gate (also fill in values in `kep.yaml`)
-    - Feature gate name:DynamicNodeResize
+    - Feature gate name:NodeResourceHotPlug
     - Components depending on the feature gate: kubelet
 - [ ] Other
     - Describe the mechanism:
@@ -419,40 +325,26 @@ well as the [existing list] of feature gates.
 
 ###### Does enabling the feature change any default behavior?
 
-<!--
-Any change of default behavior may be surprising to users or break existing
-automations, so be extremely careful here.
--->
+No. This feature is guarded by a feature gate. Existing default behavior does not change if the
+feature is not used. 
+Even if the feature is enabled via feature gate, If there is no change in 
+node configuration the system will continue to work in the same way.
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
-<!--
-Describe the consequences on existing workloads (e.g., if this is a runtime
-feature, can it break the existing applications?).
-
-Feature gates are typically disabled by setting the flag to `false` and
-restarting the component. No other changes should be necessary to disable the
-feature.
-
-NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
--->
+Yes. Once disabled any hot plug of resources won't reflect at the cluster level without kubelet restart. 
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
-###### Are there any tests for feature enablement/disablement?
+If the feature is reenabled again, the node resources can be hot plugged in again. Cluster will be automatically udpated
+with the new resource information.
 
-<!--
-The e2e framework does not currently support enabling or disabling feature
-gates. However, unit tests in each component dealing with managing data, created
-with and without the feature, are necessary. At the very least, think about
-conversion tests if API types are being modified.
+###### Are there any tests for feature enablement/disablement?
 
-Additionally, for features that are introducing a new API field, unit tests that
-are exercising the `switch` of feature gate itself (what happens if I disable a
-feature gate after having objects written with the new field) are also critical.
-You can take a look at one potential example of such test in:
-https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
--->
+Yes, the tests will be added along with alpha implementation.
+* Validate the hot plug of resource to machine is updated at the node resource level.
+* Validate the hot plug of resource made the pending pods to transition into running state.
+* Validate the resource managers are update with the latest machine information after hot plug of resources.
 
 ### Rollout, Upgrade and Rollback Planning
 
@@ -472,6 +364,11 @@ rollout. Similarly, consider large clusters and how enablement/disablement
 will rollout across nodes.
 -->
 
+Rollout may fail if the resource managers are not re-synced properly due to programatic errors.
+In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain
+in the pending state only.
+Rollback failure should not affect running workloads.
+
 ###### What specific metrics should inform a rollback?
 
 <!--
@@ -479,6 +376,9 @@ What signals should users be paying attention to when the feature is young
 that might indicate a serious problem?
 -->
 
+In case of pending pods and hot plug of resource but still there is no change `scheduler_pending_pods` metric
+means the feature is not working as expected.
+
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
 <!--
@@ -487,12 +387,14 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
 are missing a bunch of machinery and tooling and can't do that now.
 -->
 
+It will be tested manually as a part of implementation and there will also be automated tests to cover the scenarios.
+
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
 <!--
 Even if applying deprecation policies, they may still surprise some users.
 -->
-
+No
 ### Monitoring Requirements
 
 <!--
@@ -510,6 +412,10 @@ checking if there are objects with field X set) may be a last resort. Avoid
 logs or events for this purpose.
 -->
 
+This feature will be built into kubelet and behind a feature gate. Examining the kubelet feature gate would help 
+in determining whether the feature is used. The enablement of the kubelet feature gate can be determined from the 
+`kubernetes_feature_enabled` metric.
+
 ###### How can someone using this feature know that it is working for their instance?
 
 <!--
@@ -521,13 +427,9 @@ and operation of this feature.
 Recall that end users cannot usually observe component logs or access metrics.
 -->
 
-- [ ] Events
-    - Event Reason:
-- [ ] API .status
-    - Condition name:
-    - Other field:
-- [ ] Other (treat as last resort)
-    - Details:
+End user can do a hot plug of resource and verify the same change as reflected at the node resource level.
+In case there were any pending pods prior to resource hot plug, those pods should transition into Running with addition
+of new resources.
 
 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
 
@@ -545,19 +447,16 @@ high level (needs more precise definitions) those may be things like:
 These goals will help you determine what you need to measure (SLIs) in the next
 question.
 -->
-
+No increase in the `scheduler_pending_pods` rate.
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
 <!--
 Pick one more of these and delete the rest.
 -->
 
-- [ ] Metrics
-    - Metric name:
-    - [Optional] Aggregation method:
-    - Components exposing the metric:
-- [ ] Other (treat as last resort)
-    - Details:
+- [X] Metrics
+    - Metric name: `scheduler_pending_pods`
+    - Components exposing the metric: scheduler
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
@@ -565,7 +464,7 @@ Pick one more of these and delete the rest.
 Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
 implementation difficulties, etc.).
 -->
-
+No
 ### Dependencies
 
 <!--
@@ -588,6 +487,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
       - Impact of its outage on the feature:
       - Impact of its degraded performance or high-error rates on the feature:
 -->
+No, It does not depend on any service running on the cluster, But depends on cAdvisor package to fetch
+the machine resource information.
 
 ### Scalability
 
@@ -616,6 +517,10 @@ Focusing mostly on:
     heartbeats, leader election, etc.)
 -->
 
+No, It won't add/modify any user facing APIs.
+The resource managers might need to be updated with new methods to resync their components with updated
+machine information.
+
 ###### Will enabling / using this feature result in introducing new API types?
 
 <!--
@@ -624,7 +529,7 @@ Describe them, providing:
   - Supported number of objects per cluster
   - Supported number of objects per namespace (for namespace-scoped objects)
 -->
-
+No 
 ###### Will enabling / using this feature result in any new calls to the cloud provider?
 
 <!--
@@ -632,7 +537,7 @@ Describe them, providing:
   - Which API(s):
   - Estimated increase:
 -->
-
+No
 ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
 
 <!--
@@ -641,7 +546,7 @@ Describe them, providing:
   - Estimated increase in size: (e.g., new annotation of size 32B)
   - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
 -->
-
+No
 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
 
 <!--
@@ -652,7 +557,7 @@ Think about adding additional work or introducing new steps in between
 
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
 -->
-
+Negligible, In the case of resource hot plug the resource manager may take some time to resync.
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
 <!--
@@ -664,7 +569,8 @@ This through this both in small and large cases, again with respect to the
 
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 -->
-
+Negligible computational overhead might be introduced into kubelet as it periodically needs to fetch machine information 
+from cAdvisor cache and resync all the resource managers with the updated machine information.
 ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
 
 <!--
@@ -676,6 +582,9 @@ If any of the resources can be exhausted, how this is mitigated with the existin
 Are there any tests that were run/should be run to understand performance characteristics better
 and validate the declared limits?
 -->
+Yes, It could.
+Since the nodes computational capacity is increased dynamically there might be more pods scheduled on the node.
+This is however be mitigated by maxPods kubelet configuration that limits the number of pods on a node.
 
 ### Troubleshooting
 
@@ -692,6 +601,10 @@ details). For now, we leave it here.
 
 ###### How does this feature react if the API server and/or etcd is unavailable?
 
+This feature is node local and mainly handled in kubelet, It has no dependency on etcd.
+In case there are pending pods and there is hot plug of resources, The scheduler relies on the API server to fetch node information. 
+Without access to the API server, it cannot make scheduling decisions as the node resources are not updated. The pending pods would remain in same condition.
+
 ###### What are other known failure modes?
 
 <!--
@@ -707,7 +620,14 @@ For each of them, fill in the following information by copying the below templat
     - Testing: Are there any tests for failure mode? If not, describe why.
 -->
 
+This feature mainly does two things fetch machine information from cAdvisor and reinitialize resource managers.
+Failure scenarios can occur in cAdvisor level that is if it wrongly updated with incorrect machine information.
+
+
 ###### What steps should be taken if SLOs are not being met to determine the problem?
+If enabling this feature causes performance degradation, its suggested not to hot plug resources and restart the kubelet
+to manually to continue operation as before.
+
 
 ## Implementation History
 
@@ -730,16 +650,10 @@ Why should this KEP _not_ be implemented?
 
 ## Alternatives
 
+Existing and the alternative to this effort would be restarting the kubelet manually each time after the node resize.
+
 <!--
 What other approaches did you consider, and why did you rule them out? These do
 not need to be as detailed as the proposal, but should include enough
 information to express the idea and why it was not acceptable.
 -->
-
-## Infrastructure Needed (Optional)
-
-<!--
-Use this section if you need things from the project/SIG. Examples include a
-new subproject, repos requested, or GitHub details. Listing these here allows a
-SIG to get the process for these resources started right away.
--->
diff --git a/keps/sig-node/3953-dynamic-node-resize/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
similarity index 82%
rename from keps/sig-node/3953-dynamic-node-resize/kep.yaml
rename to keps/sig-node/3953-node-resource-hot-plug/kep.yaml
index ea02aca8620..232e4542e04 100644
--- a/keps/sig-node/3953-dynamic-node-resize/kep.yaml
+++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
@@ -1,4 +1,4 @@
-title: Dynamic node resize
+title: Node Resource Hot Plug
 kep-number: 3953
 authors:
   - "@Karthik-K-N"
@@ -13,13 +13,14 @@ reviewers:
   - "@smarterclayton"
   - "@ffromani"
   - "@SergeyKanzhelev"
+  - "@haircommander"
 approvers:
   - "@sig-node-leads"
 see-also:
 
 stage: "alpha"
 
-latest-milestone: "v1.32"
+latest-milestone: "v1.33"
 
 milestone:
   alpha: ""

From 0af00c8ceea02f4441205b0c6f6fb988497c4b38 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Thu, 30 Jan 2025 20:50:27 +0530
Subject: [PATCH 05/19] Address reveiw comments

---
 keps/prod-readiness/sig-node/3953.yaml        |  3 +++
 .../3953-node-resource-hot-plug/README.md     | 19 +++++++++++++++++--
 .../3953-node-resource-hot-plug/kep.yaml      | 18 ++++++++++++++++++
 3 files changed, 38 insertions(+), 2 deletions(-)
 create mode 100644 keps/prod-readiness/sig-node/3953.yaml

diff --git a/keps/prod-readiness/sig-node/3953.yaml b/keps/prod-readiness/sig-node/3953.yaml
new file mode 100644
index 00000000000..cc389c5e866
--- /dev/null
+++ b/keps/prod-readiness/sig-node/3953.yaml
@@ -0,0 +1,3 @@
+kep-number: 3953
+alpha:
+  approver: "@deads2k"
\ No newline at end of file
diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 44e2fc07ee2..dfb06debe9d 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -17,8 +17,9 @@ tags, and then generate with `hack/update-toc.sh`.
   - [Goals](#goals)
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
-  - [User Stories (Optional)](#user-stories-optional)
+  - [User Stories](#user-stories)
     - [Story 1](#story-1)
+    - [Story 2](#story-2)
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
@@ -26,7 +27,10 @@ tags, and then generate with `hack/update-toc.sh`.
       - [Unit tests](#unit-tests)
       - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
+    - [Phase 1: Alpha (target 1.33)](#phase-1-alpha-target-133)
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
+      - [Upgrade](#upgrade)
+      - [Downgrade](#downgrade)
   - [Version Skew Strategy](#version-skew-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
   - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
@@ -38,7 +42,6 @@ tags, and then generate with `hack/update-toc.sh`.
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
-- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
 <!-- /toc -->
 
 ## Release Signoff Checklist
@@ -256,6 +259,16 @@ enhancement:
   cluster required to make on upgrade, in order to make use of the enhancement?
 -->
 
+##### Upgrade 
+
+To upgrade the cluster to use this feature, Kubelet should be updated to enable featuregate. 
+Existing cluster does not have any impact as the node resources already been updated during cluster creation.
+
+##### Downgrade
+
+It's always possible to trivially downgrade to the previous kubelet, It does not have any impact as the future node resource hot plug wont be reflected in cluster
+without manual kubelet restart.
+
 ### Version Skew Strategy
 
 <!--
@@ -271,6 +284,8 @@ enhancement:
   CRI or CNI may require updating that component before the kubelet.
 -->
 
+Not relevant, As this kubelet specific feature and does not impact other components.
+
 ## Production Readiness Review Questionnaire
 
 <!--
diff --git a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
index 232e4542e04..618799577cd 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
+++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
@@ -17,12 +17,30 @@ reviewers:
 approvers:
   - "@sig-node-leads"
 see-also:
+replaces:
 
+# The target maturity stage in the current dev cycle for this KEP.
 stage: "alpha"
 
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
 latest-milestone: "v1.33"
 
+# The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
   alpha: ""
   beta: ""
   stable: ""
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+feature-gates:
+  - name: NodeResourceHotPlug
+    components:
+      - kubelet
+disable-supported: true
+
+# The following PRR answers are required at beta release
+metrics:
+  - N/A

From d3d7ffa17808596ff174e9f0aaae677f9524cc32 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Tue, 4 Feb 2025 18:06:49 +0530
Subject: [PATCH 06/19] Address review comments

---
 .../3953-node-resource-hot-plug/README.md     | 82 +++++++++++++------
 .../3953-node-resource-hot-plug/kep.yaml      |  4 +-
 2 files changed, 59 insertions(+), 27 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index dfb06debe9d..1dc0f9b0969 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -42,6 +42,8 @@ tags, and then generate with `hack/update-toc.sh`.
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 - [Alternatives](#alternatives)
+- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
+- [Future Work](#future-work)
 <!-- /toc -->
 
 ## Release Signoff Checklist
@@ -74,28 +76,31 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-The proposal aims at enabling hot plugging of node compute resources. This will help in updating cluster resource capacity by just resizing compute resources of nodes rather than adding new node to a cluster.
-The updated node configurations are to be reflected at the node and cluster levels automatically without the need to reset the kubelet.
-
-This proposal also aims to improve the initialization and reinitialization of resource managers, such as the CPU manager and memory manager, in response to changes in a node's CPU and memory configurations.
-
+The proposal seeks to facilitate hot plugging of node compute resources, thereby streamlining cluster resource capacity updates through node compute resource resizing, rather than introducing new nodes to the cluster.
+The revised node configurations will be automatically propagated at both the node and cluster levels, eliminating the necessity for a kubelet reset.
+Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations.
+This approach aims to optimize resource management, improve scalability, and minimize disruptions to cluster operations.
 ## Motivation
-In a typical Kubernetes environment, the cluster resources may need to be altered due to following reasons:
-- Incorrect resource assignment during cluster creation.
-- Increased workload over time, leading to the need for additional resources in the cluster.
 
-To handle these scenarios, we can:
-- Horizontally scale up the cluster by adding compute nodes.
-- Vertically scale up the cluster by increasing node capacity. However, currently, the workaround for capturing node resizing in the cluster involves restarting the Kubelet.
+In a conventional Kubernetes environment, the cluster resources might necessitate modification due to the following factors:
+- Inaccurate resource allocation during cluster initialization.
+- Escalating workload over time, necessitating supplementary resources within the cluster.
 
-Node resource hot plugging will provide advantages in scenarios such as:
-- Handling resource demand with a limited set of nodes by increasing the capacity of existing nodes instead of creating new nodes.
-- Creating new nodes takes more time compared to increasing the capacity of existing nodes.
+To address these situations, we can:
+- Horizontally scale the cluster by incorporating additional compute nodes.
+- Vertically scale the cluster by augmenting node capacity. Currently, the method to capture node resizing within the cluster entails restarting the Kubelet.
 
+These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization. However, the limitation of requiring a Kubelet restart for node resizing is an area for potential improvement.
+
+Node resource hot plugging offers benefits in situations like:
+- Managing resource demand with a restricted number of nodes by enhancing the capacity of current nodes rather than creating new ones.
+- The process of creating new nodes is more time-consuming compared to augmenting the capacity of existing nodes.
+ 
+This approach allows for more efficient resource management and quicker capacity adjustments, optimizing the utilization of existing hardware.
 ### Goals
 
-* Dynamically scale up the node by hot plugging  resources and without restarting the kubelet.
-* Ability to reinitialize resource managers (CPU manager, memory manager) to adopt changes in node's resource.
+* Achieve seamless node capacity expansion through hot plugging resources, all without necessitating a kubelet restart.
+* Facilitate the reinitialization of resource managers (CPU manager, memory manager) to accommodate alterations in the node's resource allocation.
 
 ### Non-Goals
 
@@ -103,12 +108,12 @@ Node resource hot plugging will provide advantages in scenarios such as:
 * Hot unplug of node resources.
 * Update the autoscaler to utilize resource hot plugging.
 
-
 ## Proposal
 
-This KEP aims to support the node resource hot plugging by adding a polling mechanism in kubelet to fetch the machine-information from cAdvisor's cache which is already updated periodically, This information will be fetched periodically by kubelet, after which the node status updater is responsible for updating this information at node level in the cluster.
 
-Additionally, this KEP aims to improve the initialization and reinitialization of resource managers, such as the memory manager and CPU manager, so that they can adapt to change in node's configurations.
+This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically.
+The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster.
+Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations.
 
 ### User Stories
 
@@ -118,7 +123,7 @@ As a cluster admin, I must be able to increase the cluster resource capacity wit
 
 #### Story 2
 
-As a cluster admin, I must be able to increase the cluster resource capacity without need to restarting the kubelet.
+As a cluster admin, I must be able to increase the cluster resource capacity without need to restart the kubelet.
 
 ### Notes/Constraints/Caveats (Optional)
 
@@ -158,7 +163,20 @@ The interaction sequence is as follows
 3. Node status updater will update the node's status with the latest resource information.
 4. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes.
 
-Note: In case of increase in cluster resources, the scheduler will automatically schedule any pending pods.
+With increase in cluster resources the following components will updated
+
+1. Scheduler
+   * Scheduler will automatically schedule any pending pods.
+
+
+2. Change in Swap Memory limit
+   * Currently, the swap memory limit is calculated as 
+ `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
+   * So increase in nodeTotalMemory will result in updated swap memory limit.
+
+
+3. Change in OOM score
+   * OOM score calculation depends on machine's memory, so the new OOM score will be updated accordingly.
 
 **Proposed Code changes**
 
@@ -172,7 +190,9 @@ Note: In case of increase in cluster resources, the scheduler will automatically
 			klog.ErrorS(err, "Error fetching machine info")
 		} else {
 			cachedMachineInfo, _ := kl.GetCachedMachineInfo()
-
+            // Avoid collector collects it as a timestamped metric
+            // See PR #95210 and #97006 for more details.
+                machineInfo.Timestamp = time.Time{}
 			if !reflect.DeepEqual(cachedMachineInfo, machineInfo) {
 				kl.setCachedMachineInfo(machineInfo)
 
@@ -204,8 +224,8 @@ Note: In case of increase in cluster resources, the scheduler will automatically
 2. Adding a method Sync to all the resource managers and will be invoked once there is dynamic resource change.
 
 ```go
-        // Sync will sync the CPU Manager with the latest machine info
-	Sync(machineInfo *cadvisorapi.MachineInfo) error
+        // SyncMachineInfo will sync the Manager with the latest machine info
+        SyncMachineInfo(machineInfo *cadvisorapi.MachineInfo) error
 ```
 
 ### Test Plan
@@ -228,7 +248,7 @@ Following scenarios need to be covered:
 
 * Node resource information before and after resource hot plug.
 * State of Pending pods due to lack of resources after resource hot plug.
-* Resource manager states after the resynch of components.
+* Resource manager states after the resync of components.
 
 ### Graduation Criteria
 
@@ -269,6 +289,7 @@ Existing cluster does not have any impact as the node resources already been upd
 It's always possible to trivially downgrade to the previous kubelet, It does not have any impact as the future node resource hot plug wont be reflected in cluster
 without manual kubelet restart.
 
+
 ### Version Skew Strategy
 
 <!--
@@ -351,7 +372,7 @@ Yes. Once disabled any hot plug of resources won't reflect at the cluster level
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
-If the feature is reenabled again, the node resources can be hot plugged in again. Cluster will be automatically udpated
+If the feature is reenabled again, the node resources can be hot plugged in again. Cluster will be automatically updated
 with the new resource information.
 
 ###### Are there any tests for feature enablement/disablement?
@@ -662,6 +683,8 @@ Major milestones might include:
 <!--
 Why should this KEP _not_ be implemented?
 -->
+Hot Unplug of resource is not supported so any decrease in node resources will be automatically updated but the Pods
+re-admission is not done so Pods may be running with low resources until kubelet is restarted.
 
 ## Alternatives
 
@@ -672,3 +695,10 @@ What other approaches did you consider, and why did you rule them out? These do
 not need to be as detailed as the proposal, but should include enough
 information to express the idea and why it was not acceptable.
 -->
+
+## Infrastructure Needed (Optional)
+VMs of cluster should support hot plug of compute resources for e2e tests.
+
+## Future Work
+
+* Support hot-unplug of node resources: Hot-Unplug of resource needs a pod-readmission, A separate KEP is planned to support this feature.
diff --git a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
index 618799577cd..9cbf5d26c62 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
+++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
@@ -14,8 +14,10 @@ reviewers:
   - "@ffromani"
   - "@SergeyKanzhelev"
   - "@haircommander"
+  - "@tallclair"
 approvers:
-  - "@sig-node-leads"
+  - "@haircommander"
+  - TBD
 see-also:
 replaces:
 

From 4375027045d517a3f964b94a5ef733e9802681cb Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Wed, 5 Feb 2025 15:24:04 +0530
Subject: [PATCH 07/19] Address review comments

---
 .../3953-node-resource-hot-plug/README.md     | 104 +++++++++++++-----
 .../3953-node-resource-hot-plug/kep.yaml      |   3 +-
 2 files changed, 76 insertions(+), 31 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 1dc0f9b0969..7c7090413f8 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -20,6 +20,7 @@ tags, and then generate with `hack/update-toc.sh`.
   - [User Stories](#user-stories)
     - [Story 1](#story-1)
     - [Story 2](#story-2)
+    - [Story 3](#story-3)
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
@@ -80,7 +81,13 @@ The proposal seeks to facilitate hot plugging of node compute resources, thereby
 The revised node configurations will be automatically propagated at both the node and cluster levels, eliminating the necessity for a kubelet reset.
 Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations.
 This approach aims to optimize resource management, improve scalability, and minimize disruptions to cluster operations.
+
 ## Motivation
+Currently, the node's configurations are recorded solely during the kubelet bootstrap phase and subsequently cached. assuming the node's compute capacity remains unchanged throughout the cluster's lifecycle.
+
+However, contemporary kernel capabilities enable the dynamic addition of CPUs and memory to a node (References: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html). 
+This can result in Kubernetes being unaware of the node's altered compute capacities during a live-resize, causing the node to retain outdated information. 
+This can lead to inconsistencies or an imbalance in the cluster, affecting the optimal scheduling and deployment of workloads.
 
 In a conventional Kubernetes environment, the cluster resources might necessitate modification due to the following factors:
 - Inaccurate resource allocation during cluster initialization.
@@ -90,13 +97,16 @@ To address these situations, we can:
 - Horizontally scale the cluster by incorporating additional compute nodes.
 - Vertically scale the cluster by augmenting node capacity. Currently, the method to capture node resizing within the cluster entails restarting the Kubelet.
 
-These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization. However, the limitation of requiring a Kubelet restart for node resizing is an area for potential improvement.
+These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization. 
+However, the limitation of requiring a Kubelet restart for node resizing is an area for potential improvement.
+
+Node resource hot plugging proves advantageous in scenarios such as:
+- Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones.
+- The procedure of establishing new nodes is considerably more time-intensive than expanding the capabilities of current nodes.
+
+Implementing this KEP will empower nodes to recognize and adapt to changes in their configurations, 
+thereby facilitating the efficient and effective deployment of pod workloads to nodes capable of meeting the required compute demands.
 
-Node resource hot plugging offers benefits in situations like:
-- Managing resource demand with a restricted number of nodes by enhancing the capacity of current nodes rather than creating new ones.
-- The process of creating new nodes is more time-consuming compared to augmenting the capacity of existing nodes.
- 
-This approach allows for more efficient resource management and quicker capacity adjustments, optimizing the utilization of existing hardware.
 ### Goals
 
 * Achieve seamless node capacity expansion through hot plugging resources, all without necessitating a kubelet restart.
@@ -107,10 +117,11 @@ This approach allows for more efficient resource management and quicker capacity
 * Dynamically adjust system reserved and kube reserved values.
 * Hot unplug of node resources.
 * Update the autoscaler to utilize resource hot plugging.
+* Re-balance workloads across the nodes.
+* Update runtime/NRI plugins with host resource changes.
 
 ## Proposal
 
-
 This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically.
 The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster.
 Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations.
@@ -119,24 +130,34 @@ Moreover, this KEP aims to refine the initialization and reinitialization proces
 
 #### Story 1
 
-As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster.
+Pinning of workloads to nodes with certain hardware capabilities with limited CPU and memory configurations.
+    Adopting this KEP will allow nodes with certain hardware capabilities to be resized to accommodate additional workloads that are dependent on particular hardware capability.
 
 #### Story 2
 
+As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster.
+
+#### Story 3
+
 As a cluster admin, I must be able to increase the cluster resource capacity without need to restart the kubelet.
 
 ### Notes/Constraints/Caveats (Optional)
 
 ### Risks and Mitigations
 
-1. Node resource hot plugging is an opt-in feature, merging the
-   feature related changes won't impact existing workloads. Moreover, the feature
-   will be rolled out gradually, beginning with an alpha release for testing and
-   gathering feedback. This will be followed by beta and GA releases as the
-   feature matures and potential problems and improvements are addressed.
-2. Though the node resource is updated dynamically, the dynamic data is fetched from cAdvisor and its well integrated with kubelet.
-3. Resource manager are updated to adapt to the dynamic node reconfigurations, Enough tests should be added to make sure its not affecting the existing functionalities.
-
+- Change in OOMScoreAdjust value:
+  - The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity`
+  - However, with change in memoryCapacity post up-scale, The OOMScoreAdjust of pod post up-scale may not be inline with the
+    precalculated scores of pod which are deployed before.
+- Change in Swap limit:
+  - The formula to calculate the swap limit is `<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
+  - However, with change in nodeTotalMemory and totalPodsSwapAvailable post up-scale, The swap limit of pod post up-scale may not be inline with the
+    precalculated scores of pod which are deployed before.
+- Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
+- Lack of coordination about change in resource availability across kubelet/runtime/plugins.
+
+- To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
+- The plugins/runtime should updated to react to change in resource information on the node.
 ## Design Details
 
 
@@ -168,15 +189,15 @@ With increase in cluster resources the following components will updated
 1. Scheduler
    * Scheduler will automatically schedule any pending pods.
 
+2. Change in OOM score adjust
+    * Currently, the OOM score adjust is calculated by
+      `1000 - (1000*containerMemReq)/memoryCapacity`
+    * So increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize.
 
-2. Change in Swap Memory limit
-   * Currently, the swap memory limit is calculated as 
+3. Change in Swap Memory limit
+   * Currently, the swap memory limit is calculated by 
  `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
-   * So increase in nodeTotalMemory will result in updated swap memory limit.
-
-
-3. Change in OOM score
-   * OOM score calculation depends on machine's memory, so the new OOM score will be updated accordingly.
+   * So increase in nodeTotalMemory will result in updated swap memory limit for pods deployed post resize.
 
 **Proposed Code changes**
 
@@ -246,7 +267,10 @@ to implement this enhancement.
 
 Following scenarios need to be covered:
 
-* Node resource information before and after resource hot plug.
+* Node resource information before and after resource hot plug for the following scenarios.
+  * upsize -> downsize
+  * upsize -> downsize -> upsize
+  * downsize- > upsize
 * State of Pending pods due to lack of resources after resource hot plug.
 * Resource manager states after the resync of components.
 
@@ -411,7 +435,7 @@ Rollback failure should not affect running workloads.
 What signals should users be paying attention to when the feature is young
 that might indicate a serious problem?
 -->
-
+If there is significant increase in `node_resize_resync_errors_total` metric means the feature is not working as expected.
 In case of pending pods and hot plug of resource but still there is no change `scheduler_pending_pods` metric
 means the feature is not working as expected.
 
@@ -440,6 +464,10 @@ For GA, this section is required: approvers should be able to confirm the
 previous answers based on experience in the field.
 -->
 
+Monitor the metrics
+- `node_resize_resync_request_total`
+- `node_resize_resync_errors_total`
+
 ###### How can an operator determine if the feature is in use by workloads?
 
 <!--
@@ -452,6 +480,9 @@ This feature will be built into kubelet and behind a feature gate. Examining the
 in determining whether the feature is used. The enablement of the kubelet feature gate can be determined from the 
 `kubernetes_feature_enabled` metric.
 
+In addition, newly added metrics `node_resize_resync_request_total`, `node_resize_resync_errors_total` are incremented in case of up-scale of resource
+and failing to re-sync resources managers respectively.
+
 ###### How can someone using this feature know that it is working for their instance?
 
 <!--
@@ -483,7 +514,11 @@ high level (needs more precise definitions) those may be things like:
 These goals will help you determine what you need to measure (SLIs) in the next
 question.
 -->
-No increase in the `scheduler_pending_pods` rate.
+
+For each node, the value of the metric `node_resize_resync_request_total` is expected to match the number of time the node is resized.
+For each node, the value of the metric `node_resize_resync_errors_total` is expected to be zero.
+
+
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
 <!--
@@ -491,8 +526,10 @@ Pick one more of these and delete the rest.
 -->
 
 - [X] Metrics
-    - Metric name: `scheduler_pending_pods`
-    - Components exposing the metric: scheduler
+    - Metric name:
+      - `node_resize_resync_request_total`
+      - `node_resize_resync_errors_total`
+   - Components exposing the metric: kubelet
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
@@ -500,7 +537,9 @@ Pick one more of these and delete the rest.
 Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
 implementation difficulties, etc.).
 -->
-No
+- `node_resize_resync_request_total`
+- `node_resize_resync_errors_total`
+
 ### Dependencies
 
 <!--
@@ -701,4 +740,9 @@ VMs of cluster should support hot plug of compute resources for e2e tests.
 
 ## Future Work
 
-* Support hot-unplug of node resources: Hot-Unplug of resource needs a pod-readmission, A separate KEP is planned to support this feature.
+* Support hot-unplug of node resources:
+    Hot-Unplug of resource needs a pod-readmission, A separate KEP is planned to support this feature.
+* Fetching machine info via CRI
+    * At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose.
+    * Presently, resource managers are updated through regular polling. Once the CRI APIs are enhanced to fetch machine information, we can significantly enhance the reinitialization of resource managers, 
+      enabling them to respond more effectively to resize events.
diff --git a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
index 9cbf5d26c62..9fab249e179 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
+++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
@@ -45,4 +45,5 @@ disable-supported: true
 
 # The following PRR answers are required at beta release
 metrics:
-  - N/A
+  - node_resize_resync_request_total
+  - node_resize_resync_errors_total

From 986ed3c0af471b12d23577f6f551e56494f20f2e Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Thu, 6 Feb 2025 19:29:11 +0530
Subject: [PATCH 08/19] Update user-stories

---
 .../3953-node-resource-hot-plug/README.md     | 62 +++++++++++++------
 1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 7c7090413f8..b07c02ffcb8 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -12,6 +12,7 @@ tags, and then generate with `hack/update-toc.sh`.
 
 <!-- toc -->
 - [Release Signoff Checklist](#release-signoff-checklist)
+- [Glossary](#glossary)
 - [Summary](#summary)
 - [Motivation](#motivation)
   - [Goals](#goals)
@@ -21,6 +22,7 @@ tags, and then generate with `hack/update-toc.sh`.
     - [Story 1](#story-1)
     - [Story 2](#story-2)
     - [Story 3](#story-3)
+    - [Story 4](#story-4)
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
@@ -75,10 +77,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 [kubernetes/kubernetes]: https://git.k8s.io/kubernetes
 [kubernetes/website]: https://git.k8s.io/website
 
+## Glossary
+
+hotplug: dynamically add compute resources (CPU, memory) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)
+
+hotunplug: dynamically remove compute resources (CPU, memory) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)
+
+
 ## Summary
 
 The proposal seeks to facilitate hot plugging of node compute resources, thereby streamlining cluster resource capacity updates through node compute resource resizing, rather than introducing new nodes to the cluster.
-The revised node configurations will be automatically propagated at both the node and cluster levels, eliminating the necessity for a kubelet reset.
+The revised node configurations will be automatically propagated at both the node and cluster levels.
 Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations.
 This approach aims to optimize resource management, improve scalability, and minimize disruptions to cluster operations.
 
@@ -95,21 +104,23 @@ In a conventional Kubernetes environment, the cluster resources might necessitat
 
 To address these situations, we can:
 - Horizontally scale the cluster by incorporating additional compute nodes.
-- Vertically scale the cluster by augmenting node capacity. Currently, the method to capture node resizing within the cluster entails restarting the Kubelet.
+- Vertically scale the cluster by augmenting node capacity. As a workaround for this issue, the method to capture node resizing within the cluster entails restarting the Kubelet.
 
-These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization. 
-However, the limitation of requiring a Kubelet restart for node resizing is an area for potential improvement.
+These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization.
+However for Vertical scaling, the current implementation does not allow the Kubelet to be aware of the changes made to the compute capacity of the node
 
 Node resource hot plugging proves advantageous in scenarios such as:
 - Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones.
 - The procedure of establishing new nodes is considerably more time-intensive than expanding the capabilities of current nodes.
+- Improved inter-pod network latencies as the inter-node traffic can be reduced if more pods can be hosted on a single node.
+- Easier to manage the cluster with fewer nodes, which brings less overhead on the control-plane
 
 Implementing this KEP will empower nodes to recognize and adapt to changes in their configurations, 
 thereby facilitating the efficient and effective deployment of pod workloads to nodes capable of meeting the required compute demands.
 
 ### Goals
 
-* Achieve seamless node capacity expansion through hot plugging resources, all without necessitating a kubelet restart.
+* Achieve seamless node capacity expansion through hot plugging resources.
 * Facilitate the reinitialization of resource managers (CPU manager, memory manager) to accommodate alterations in the node's resource allocation.
 
 ### Non-Goals
@@ -130,16 +141,25 @@ Moreover, this KEP aims to refine the initialization and reinitialization proces
 
 #### Story 1
 
-Pinning of workloads to nodes with certain hardware capabilities with limited CPU and memory configurations.
-    Adopting this KEP will allow nodes with certain hardware capabilities to be resized to accommodate additional workloads that are dependent on particular hardware capability.
+As a Kubernetes user, I want to resize nodes with existing specialized hardware (such as GPUs, FPGAs, TPUs, etc.)  or CPU Capabilities (for example:https://www.kernel.org/doc/html/v5.8/arm64/elf_hwcaps.html) 
+to allocate more resources (CPU, memory) so that additional workloads, which depend on this hardware, can be efficiently scheduled and run without manual intervention.
 
 #### Story 2
 
-As a cluster admin, I must be able to increase the cluster resource capacity without adding a new node to the cluster.
+As a Kubernetes Application Developer, I want the kernel to optimize system performance by making better use of local resources when a node is resized, so that my applications run faster with fewer disruptions. This is achieved when there are
+Fewer Context Switches: With more CPU cores and memory on a resized node, the kernel has a better chance to spread workloads out efficiently. This can reduce contention between processes, leading to fewer context switches (which can be costly in terms of CPU time) 
+and less process interference and also reduces latency.
+Better Memory Allocation: If the kernel has more memory available, it can allocate larger contiguous memory blocks, which can lead to better memory locality (i.e., keeping related data closer in physical memory), 
+reducing latency for applications that rely on large datasets, in the case of a database applications.
 
 #### Story 3
 
-As a cluster admin, I must be able to increase the cluster resource capacity without need to restart the kubelet.
+As a Site Reliability Engineer (SRE), I want to reduce the operational complexity of managing multiple worker nodes, so that I can focus on fewer resources and simplify troubleshooting and monitoring.
+
+#### Story 4
+
+As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot plug resources without waiting for new nodes to join the cluster.
+
 
 ### Notes/Constraints/Caveats (Optional)
 
@@ -310,8 +330,7 @@ Existing cluster does not have any impact as the node resources already been upd
 
 ##### Downgrade
 
-It's always possible to trivially downgrade to the previous kubelet, It does not have any impact as the future node resource hot plug wont be reflected in cluster
-without manual kubelet restart.
+It's always possible to trivially downgrade to the previous kubelet, It does not have any impact as the future node resource hot plug wont be reflected in cluster.
 
 
 ### Version Skew Strategy
@@ -392,7 +411,7 @@ node configuration the system will continue to work in the same way.
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
-Yes. Once disabled any hot plug of resources won't reflect at the cluster level without kubelet restart. 
+Yes. Once disabled any hot plug of resources won't reflect at the cluster level. 
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
@@ -424,7 +443,7 @@ rollout. Similarly, consider large clusters and how enablement/disablement
 will rollout across nodes.
 -->
 
-Rollout may fail if the resource managers are not re-synced properly due to programatic errors.
+Rollout may fail if the resource managers are not re-synced properly due to programmatic errors.
 In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain
 in the pending state only.
 Rollback failure should not affect running workloads.
@@ -700,9 +719,8 @@ Failure scenarios can occur in cAdvisor level that is if it wrongly updated with
 
 
 ###### What steps should be taken if SLOs are not being met to determine the problem?
-If enabling this feature causes performance degradation, its suggested not to hot plug resources and restart the kubelet
-to manually to continue operation as before.
 
+If the SLOs are not being met one can examine the kubelet logs and its also advised not to hotplug the node resources.
 
 ## Implementation History
 
@@ -722,12 +740,13 @@ Major milestones might include:
 <!--
 Why should this KEP _not_ be implemented?
 -->
-Hot Unplug of resource is not supported so any decrease in node resources will be automatically updated but the Pods
-re-admission is not done so Pods may be running with low resources until kubelet is restarted.
+
+Currently, This KEP only focuses on resource hotplug however in a case where the node is downsized its possible that the
+nodes capacity may be lower than existing workloads memory requirement.
 
 ## Alternatives
 
-Existing and the alternative to this effort would be restarting the kubelet manually each time after the node resize.
+Horizontally scale the cluster by incorporating additional compute nodes.
 
 <!--
 What other approaches did you consider, and why did you rule them out? These do
@@ -741,7 +760,12 @@ VMs of cluster should support hot plug of compute resources for e2e tests.
 ## Future Work
 
 * Support hot-unplug of node resources:
-    Hot-Unplug of resource needs a pod-readmission, A separate KEP is planned to support this feature.
+  * Pod re-admission:
+    * Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running
+      or if it has to be terminated due to memory crunch.
+  * Recalculate OOM adjust score and Swap limits:
+    * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
+  
 * Fetching machine info via CRI
     * At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose.
     * Presently, resource managers are updated through regular polling. Once the CRI APIs are enhanced to fetch machine information, we can significantly enhance the reinitialization of resource managers, 

From 42479dcf1b0dcc2dee4ba2da7af28946fa7f7e40 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Tue, 11 Feb 2025 15:06:14 +0530
Subject: [PATCH 09/19] Address review comments

---
 keps/prod-readiness/sig-node/3953.yaml        |  2 +-
 .../3953-node-resource-hot-plug/README.md     | 65 +++++++++++++------
 .../3953-node-resource-hot-plug/kep.yaml      |  3 +-
 3 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/keps/prod-readiness/sig-node/3953.yaml b/keps/prod-readiness/sig-node/3953.yaml
index cc389c5e866..7416eaf941f 100644
--- a/keps/prod-readiness/sig-node/3953.yaml
+++ b/keps/prod-readiness/sig-node/3953.yaml
@@ -1,3 +1,3 @@
 kep-number: 3953
 alpha:
-  approver: "@deads2k"
\ No newline at end of file
+  approver: "@deads2k"
diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index b07c02ffcb8..d6e40f3afa7 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -122,6 +122,7 @@ thereby facilitating the efficient and effective deployment of pod workloads to
 
 * Achieve seamless node capacity expansion through hot plugging resources.
 * Facilitate the reinitialization of resource managers (CPU manager, memory manager) to accommodate alterations in the node's resource allocation.
+* Recalculating the OOMScoreAdj and swap memory limit for existing pods.
 
 ### Non-Goals
 
@@ -136,13 +137,20 @@ thereby facilitating the efficient and effective deployment of pod workloads to
 This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically.
 The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster.
 Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations.
+With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize and tradeoff between performance overhead and OOMScoreAdj and swap limit correctness.
+
+Update in OOMScoreAdjust and swap limit
+    - The recalculation will help the existing pods to be closer to the expected value post-resize but carries an inherent overhead on the kubelet.
+
+Not Updating OOMScoreAdjust and swap limit
+    - Efficient, We can avoid the performance overhead required for recalculating the scores and may lead to incorrect eviction, with new pods being evicted as OOMScoreAdj will be high.
 
 ### User Stories
 
 #### Story 1
 
-As a Kubernetes user, I want to resize nodes with existing specialized hardware (such as GPUs, FPGAs, TPUs, etc.)  or CPU Capabilities (for example:https://www.kernel.org/doc/html/v5.8/arm64/elf_hwcaps.html) 
-to allocate more resources (CPU, memory) so that additional workloads, which depend on this hardware, can be efficiently scheduled and run without manual intervention.
+As a Kubernetes user, I want to allocate more resources (CPU, memory) to a node with existing specialized hardware (such as GPUs, FPGAs, TPUs, etc.) or CPU Capabilities (for example:https://www.kernel.org/doc/html/v5.8/arm64/elf_hwcaps.html) 
+so that additional workloads can leverage the hardware to be efficiently scheduled and run without manual intervention.
 
 #### Story 2
 
@@ -166,22 +174,28 @@ As a Cluster administrator, I want to resize a Kubernetes node dynamically, so t
 ### Risks and Mitigations
 
 - Change in OOMScoreAdjust value:
-  - The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity`
-  - However, with change in memoryCapacity post up-scale, The OOMScoreAdjust of pod post up-scale may not be inline with the
-    precalculated scores of pod which are deployed before.
+    - The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity`
+    - With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the
+      actual OOMScoreAdjust for existing pods.
+        - This can be mitigated by recalculating the OOMScoreAdjust value for the existing pods. However, there can be an associated overhead for
+          recalculating the scores.
 - Change in Swap limit:
-  - The formula to calculate the swap limit is `<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
-  - However, with change in nodeTotalMemory and totalPodsSwapAvailable post up-scale, The swap limit of pod post up-scale may not be inline with the
-    precalculated scores of pod which are deployed before.
+    - The formula to calculate the swap limit is `<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
+    - With change in nodeTotalMemory and totalPodsSwapAvailable post up-scale, The existing swap limit may not be inline with the
+      actual swap limit for existing pods.
+        - This can be mitigated by recalculating the swap limit for the existing pods. However, there can be an associated overhead for
+          recalculating the scores.
+
 - Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
 - Lack of coordination about change in resource availability across kubelet/runtime/plugins.
 
 - To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
 - The plugins/runtime should updated to react to change in resource information on the node.
-## Design Details
 
 
-Below diagram is shows the interaction between kubelet and cAdvisor.
+## Design Details
+
+Below diagram is shows the interaction between kubelet, node and cAdvisor.
 
 
 ```mermaid
@@ -196,6 +210,7 @@ sequenceDiagram
     cAdvisor-cache->>kubelet: update
     kubelet->>node: node status update
     kubelet->>node: re-initialize resource managers
+    kubelet->>node: Recalculate and update OOMScoreAdj <br> and Swap limit of pods
 ```
 
 The interaction sequence is as follows
@@ -203,25 +218,30 @@ The interaction sequence is as follows
 2. Kubelet's cache will be updated with the latest machine resource information.
 3. Node status updater will update the node's status with the latest resource information.
 4. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes.
+5. Kubelet will recalculate and update the OOMScoreAdj and swap limit for the existing pods.
 
-With increase in cluster resources the following components will updated
+With increase in cluster resources the following components will be updated
 
 1. Scheduler
    * Scheduler will automatically schedule any pending pods.
 
-2. Change in OOM score adjust
+2. Update in Node allocatable capacity.
+
+3. Resource managers will re-initialised.
+
+4. Change in OOM score adjust
     * Currently, the OOM score adjust is calculated by
       `1000 - (1000*containerMemReq)/memoryCapacity`
-    * So increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize.
+    * Increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize and also recalculate the same for existing pods.
 
-3. Change in Swap Memory limit
+5. Change in Swap Memory limit
    * Currently, the swap memory limit is calculated by 
  `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
-   * So increase in nodeTotalMemory will result in updated swap memory limit for pods deployed post resize.
+   * Increase in nodeTotalMemory will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
 
 **Proposed Code changes**
 
-**Dynamic Node Scale Up logic**
+**Pseudocode for Resource Hotplug**
 
 ```go
 	if utilfeature.DefaultFeatureGate.Enabled(features.NodeResourceHotPlug) {
@@ -238,9 +258,16 @@ With increase in cluster resources the following components will updated
 				kl.setCachedMachineInfo(machineInfo)
 
 				// Resync the resource managers
-				if err := kl.ResyncComponents(machineInfo); err != nil {
+				if err := kl.containerManager.ResyncComponents(machineInfo); err != nil {
 					klog.ErrorS(err, "Error resyncing the kubelet components with machine info")
 				}
+				
+				// Recalculate OOMScoreAdj and Swap Limit.
+				// NOTE: we will make use UpdateContainerResources CRI method to update the values.
+                if err := kl.RecalculateOOMScoreAdjAndSwap(); err != nil {
+                    klog.ErrorS(err, "Error recalculating OOMScoreAdj and Swap")
+                }
+				
 			}
 		}
 	}
@@ -262,7 +289,7 @@ With increase in cluster resources the following components will updated
     )
 ```
 
-2. Adding a method Sync to all the resource managers and will be invoked once there is dynamic resource change.
+2. Adding a method Sync to all the resource managers and will be invoked once there is resource hotplug.
 
 ```go
         // SyncMachineInfo will sync the Manager with the latest machine info
@@ -302,8 +329,6 @@ Following scenarios need to be covered:
 
 * Feature is disabled by default. It is an opt-in feature which can be enabled by enabling the `NodeResourceHotPlug`
   feature gate.
-* Support the basic functionality for scheduler to consider pod-level resource requests to find a suitable node.
-* Support the basic functionality for kubelet to translate pod-level requests/limits to pod-level cgroup settings.
 * Unit test coverage.
 * E2E tests.
 * Documentation mentioning high level design.
diff --git a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
index 9fab249e179..db8a5c4e903 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
+++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
@@ -17,7 +17,8 @@ reviewers:
   - "@tallclair"
 approvers:
   - "@haircommander"
-  - TBD
+  - "@SergeyKanzhelev"
+  - "@ffromani"
 see-also:
 replaces:
 

From 0c96977f0a2efc3e161e220c808fb7b87b02de48 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Wed, 12 Feb 2025 15:17:22 +0530
Subject: [PATCH 10/19] Address review comments

Co-authored-by: kishen-v <kishen.viswanathan@ibm.com>
---
 .../3953-node-resource-hot-plug/README.md     | 75 ++++++++++++-------
 1 file changed, 49 insertions(+), 26 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index d6e40f3afa7..119f8be0a94 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -79,49 +79,59 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Glossary
 
-hotplug: dynamically add compute resources (CPU, memory) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)
+Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (online offlined resources) or via hardware (physical additions while the system is running)
 
-hotunplug: dynamically remove compute resources (CPU, memory) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)
+Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)
 
 
 ## Summary
 
-The proposal seeks to facilitate hot plugging of node compute resources, thereby streamlining cluster resource capacity updates through node compute resource resizing, rather than introducing new nodes to the cluster.
+The proposal seeks to facilitate hot plugging of node compute resources(CPU, Memory, Swap Capacity and HugePages), thereby streamlining cluster resource capacity updates through node compute resource resizing rather than introducing new nodes to the cluster.
 The revised node configurations will be automatically propagated at both the node and cluster levels.
-Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations.
-This approach aims to optimize resource management, improve scalability, and minimize disruptions to cluster operations.
+
+Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations and 
+aims to optimize resource management, improve scalability, and minimize disruptions to cluster operations.
 
 ## Motivation
-Currently, the node's configurations are recorded solely during the kubelet bootstrap phase and subsequently cached. assuming the node's compute capacity remains unchanged throughout the cluster's lifecycle.
+Currently, the node's resource configurations are recorded solely during the kubelet bootstrap phase and subsequently cached, assuming the node's compute capacity remains unchanged throughout the cluster's lifecycle.
+In a conventional Kubernetes environment, the cluster resources might necessitate modification because of inaccurate resource allocation during cluster initialization or escalating workload over time, 
+necessitating supplementary resources within the cluster.
+
+Contemporarily, kernel capabilities enable the dynamic addition of CPUs and memory to a node (References: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html).
+This can be across different architecture and compute environments like Cloud, Bare metal or VM. During this exercise it can lead to Kubernetes being unaware of the node's altered compute capacities during a live-resize,
+causing the node to retain outdated information and leading to inconsistencies or an imbalance in the cluster, thus affecting the optimal scheduling and deployment of workloads. As a side-effect, it is also possible for the workloads
+to be force migrated to a different node, causing a temporary spike in the CPU/Memory utilisation which is undesirable.
 
-However, contemporary kernel capabilities enable the dynamic addition of CPUs and memory to a node (References: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html). 
-This can result in Kubernetes being unaware of the node's altered compute capacities during a live-resize, causing the node to retain outdated information. 
-This can lead to inconsistencies or an imbalance in the cluster, affecting the optimal scheduling and deployment of workloads.
+With the current state of implementation in the Kubernetes realm, the available workarounds to allow the cluster to be aware of the changes in the cluster's capacity is by 
+restarting the node or at-least restarting the kubelet, which does not have a certain set of best-practices to follow.
 
-In a conventional Kubernetes environment, the cluster resources might necessitate modification due to the following factors:
-- Inaccurate resource allocation during cluster initialization.
-- Escalating workload over time, necessitating supplementary resources within the cluster.
+However, this approach does carry a few drawbacks such as
+ - Introducing a downtime for the existing/to-be-scheduled workloads on the cluster until the node is available.
+ - Necessity to reconfigure the underlying services post node-reboot.
+ - Managing the associated nuances that a kubelet restart or node reboot carries such as
+   - https://github.com/kubernetes/kubernetes/issues/109595
+   - https://github.com/kubernetes/kubernetes/issues/119645
+   - https://github.com/kubernetes/kubernetes/issues/125579
+   - https://github.com/kubernetes/kubernetes/issues/127793
 
-To address these situations, we can:
-- Horizontally scale the cluster by incorporating additional compute nodes.
-- Vertically scale the cluster by augmenting node capacity. As a workaround for this issue, the method to capture node resizing within the cluster entails restarting the Kubelet.
+Hence, it is necessary to handle the updates in the compute capacity in a graceful fashion across the cluster, than adopting to reset the cluster components to achieve the same.
 
-These strategies enable the cluster to adapt to varying resource demands, ensuring optimal performance and efficient resource utilization.
-However for Vertical scaling, the current implementation does not allow the Kubelet to be aware of the changes made to the compute capacity of the node
+Also, given that the capability to live-resize a node exists in the kernel, enabling the kubelet to be aware of the underlying changes in the node's compute capacity will mitigate any actions that are required to be made
+by the Kubernetes administrator.
 
 Node resource hot plugging proves advantageous in scenarios such as:
 - Efficiently managing resource demands with a limited number of nodes by increasing the capacity of existing nodes instead of provisioning new ones.
 - The procedure of establishing new nodes is considerably more time-intensive than expanding the capabilities of current nodes.
 - Improved inter-pod network latencies as the inter-node traffic can be reduced if more pods can be hosted on a single node.
 - Easier to manage the cluster with fewer nodes, which brings less overhead on the control-plane
+- Mitigate a few of the existing limitations/issues that are associated with a node/kubelet restart.
 
-Implementing this KEP will empower nodes to recognize and adapt to changes in their configurations, 
-thereby facilitating the efficient and effective deployment of pod workloads to nodes capable of meeting the required compute demands.
+Implementing this KEP will empower nodes to recognize and adapt to changes in their compute configurations and allow facilitating the efficient and effective deployment of pod workloads to nodes capable of meeting the required compute demands.
 
 ### Goals
 
 * Achieve seamless node capacity expansion through hot plugging resources.
-* Facilitate the reinitialization of resource managers (CPU manager, memory manager) to accommodate alterations in the node's resource allocation.
+* Enable the re-initialization of resource managers (CPU manager, memory manager) to accommodate alterations in the node's resource allocation.
 * Recalculating the OOMScoreAdj and swap memory limit for existing pods.
 
 ### Non-Goals
@@ -149,7 +159,7 @@ Not Updating OOMScoreAdjust and swap limit
 
 #### Story 1
 
-As a Kubernetes user, I want to allocate more resources (CPU, memory) to a node with existing specialized hardware (such as GPUs, FPGAs, TPUs, etc.) or CPU Capabilities (for example:https://www.kernel.org/doc/html/v5.8/arm64/elf_hwcaps.html) 
+As a Kubernetes user, I want to allocate more resources (CPU, memory) to a node with existing specialized hardware or CPU Capabilities (for example:https://www.kernel.org/doc/html/v5.8/arm64/elf_hwcaps.html)
 so that additional workloads can leverage the hardware to be efficiently scheduled and run without manual intervention.
 
 #### Story 2
@@ -168,6 +178,10 @@ As a Site Reliability Engineer (SRE), I want to reduce the operational complexit
 
 As a Cluster administrator, I want to resize a Kubernetes node dynamically, so that I can quickly hot plug resources without waiting for new nodes to join the cluster.
 
+#### Story 5
+
+As a Cluster administrator, I expect my existing workloads to function without having to undergo a disruption which is induced during capacity addition followed by a node/kubelet restart to
+detect the change in compute capacity, which can bring in additional complications.
 
 ### Notes/Constraints/Caveats (Optional)
 
@@ -187,11 +201,17 @@ As a Cluster administrator, I want to resize a Kubernetes node dynamically, so t
           recalculating the scores.
 
 - Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
+  - To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
+
 - Lack of coordination about change in resource availability across kubelet/runtime/plugins.
+  - The plugins/runtime should be updated to react to change in resource information on the node.
 
-- To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
-- The plugins/runtime should updated to react to change in resource information on the node.
+- Kubelet missing hotplug event or too many hotplug events
+  - Hotplug events are captured via periodic polling by the kubelet, this ensures that the capacity is updated in the poll cycle and can technically not miss the event/fail to handle a flood of events.
 
+- Handling downsize events
+  - Though there is no support through this KEP to handle an event of node-downsize, it's the onus of the cluster administrator to resize responsibly to avoid disruption as it lies out of the kubernetes realm.
+  - However, enabling this feature will ensure that the correct resource information is pushed across the cluster.
 
 ## Design Details
 
@@ -436,12 +456,15 @@ node configuration the system will continue to work in the same way.
 
 ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
 
-Yes. Once disabled any hot plug of resources won't reflect at the cluster level. 
+Yes. The feature can be disabled by restarting kubelet with the feature-gate off.
+Once disabled any hot plug of resources won't reflect at the cluster level.
 
 ###### What happens if we reenable the feature if it was previously rolled back?
 
-If the feature is reenabled again, the node resources can be hot plugged in again. Cluster will be automatically updated
-with the new resource information.
+To reenanble the feature, need to turn on the feature-gate and restart the kubelet,
+with feature reenabled, the node resources can be hot plugged in again. Cluster will be automatically updated
+with the new resource information. If there are any pending pods due to lack of resources they will turn into
+running state.
 
 ###### Are there any tests for feature enablement/disablement?
 

From 28b91d304d98d66e67e2f6604b38f74663a42b24 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Wed, 12 Feb 2025 19:56:03 +0530
Subject: [PATCH 11/19] Address review comments

Co-authored-by: kishen-v <kishen.viswanathan@ibm.com>
---
 keps/sig-node/3953-node-resource-hot-plug/README.md | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 119f8be0a94..7e4d7c6b970 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -23,6 +23,7 @@ tags, and then generate with `hack/update-toc.sh`.
     - [Story 2](#story-2)
     - [Story 3](#story-3)
     - [Story 4](#story-4)
+    - [Story 5](#story-5)
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
@@ -97,7 +98,7 @@ Currently, the node's resource configurations are recorded solely during the kub
 In a conventional Kubernetes environment, the cluster resources might necessitate modification because of inaccurate resource allocation during cluster initialization or escalating workload over time, 
 necessitating supplementary resources within the cluster.
 
-Contemporarily, kernel capabilities enable the dynamic addition of CPUs and memory to a node (References: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html).
+Contemporarily, kernel capabilities enable the dynamic addition of CPUs and memory to a node (for example: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html).
 This can be across different architecture and compute environments like Cloud, Bare metal or VM. During this exercise it can lead to Kubernetes being unaware of the node's altered compute capacities during a live-resize,
 causing the node to retain outdated information and leading to inconsistencies or an imbalance in the cluster, thus affecting the optimal scheduling and deployment of workloads. As a side-effect, it is also possible for the workloads
 to be force migrated to a different node, causing a temporary spike in the CPU/Memory utilisation which is undesirable.
@@ -116,7 +117,7 @@ However, this approach does carry a few drawbacks such as
 
 Hence, it is necessary to handle the updates in the compute capacity in a graceful fashion across the cluster, than adopting to reset the cluster components to achieve the same.
 
-Also, given that the capability to live-resize a node exists in the kernel, enabling the kubelet to be aware of the underlying changes in the node's compute capacity will mitigate any actions that are required to be made
+Also, given that the capability to live-resize a node exists in the Linux and Windows kernels, enabling the kubelet to be aware of the underlying changes in the node's compute capacity will mitigate any actions that are required to be made
 by the Kubernetes administrator.
 
 Node resource hot plugging proves advantageous in scenarios such as:
@@ -167,7 +168,7 @@ so that additional workloads can leverage the hardware to be efficiently schedul
 As a Kubernetes Application Developer, I want the kernel to optimize system performance by making better use of local resources when a node is resized, so that my applications run faster with fewer disruptions. This is achieved when there are
 Fewer Context Switches: With more CPU cores and memory on a resized node, the kernel has a better chance to spread workloads out efficiently. This can reduce contention between processes, leading to fewer context switches (which can be costly in terms of CPU time) 
 and less process interference and also reduces latency.
-Better Memory Allocation: If the kernel has more memory available, it can allocate larger contiguous memory blocks, which can lead to better memory locality (i.e., keeping related data closer in physical memory), 
+Better Memory Allocation: If the kernel has more memory available, it can allocate larger contiguous memory blocks, which can lead to better memory locality (i.e., keeping related data closer in physical memory),improved paging and swap limits thus 
 reducing latency for applications that rely on large datasets, in the case of a database applications.
 
 #### Story 3
@@ -257,7 +258,7 @@ With increase in cluster resources the following components will be updated
 5. Change in Swap Memory limit
    * Currently, the swap memory limit is calculated by 
  `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
-   * Increase in nodeTotalMemory will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
+   * Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
 
 **Proposed Code changes**
 

From 40b8dec5a8b481b51151937ad9a76133e4637948 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Thu, 13 Feb 2025 12:09:28 +0530
Subject: [PATCH 12/19] Address review comments

Co-authored-by: kishen-v <kishen.viswanathan@ibm.com>
---
 .../3953-node-resource-hot-plug/README.md     | 27 +++++++++++++------
 .../3953-node-resource-hot-plug/kep.yaml      |  1 +
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 7e4d7c6b970..9798fd1ecd2 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -18,6 +18,7 @@ tags, and then generate with `hack/update-toc.sh`.
   - [Goals](#goals)
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
+  - [Handling HotUnplug Events](#handling-hotunplug-events)
   - [User Stories](#user-stories)
     - [Story 1](#story-1)
     - [Story 2](#story-2)
@@ -132,8 +133,8 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
 ### Goals
 
 * Achieve seamless node capacity expansion through hot plugging resources.
-* Enable the re-initialization of resource managers (CPU manager, memory manager) to accommodate alterations in the node's resource allocation.
-* Recalculating the OOMScoreAdj and swap memory limit for existing pods.
+* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager to accommodate alterations in the node's resource allocation.
+* Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods.
 
 ### Non-Goals
 
@@ -148,13 +149,22 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
 This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically.
 The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster.
 Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations.
-With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize and tradeoff between performance overhead and OOMScoreAdj and swap limit correctness.
+With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize. But this carries small overhead due to recalculation of swap and OOMScoreAdj.
 
-Update in OOMScoreAdjust and swap limit
-    - The recalculation will help the existing pods to be closer to the expected value post-resize but carries an inherent overhead on the kubelet.
+### Handling HotUnplug Events
 
-Not Updating OOMScoreAdjust and swap limit
-    - Efficient, We can avoid the performance overhead required for recalculating the scores and may lead to incorrect eviction, with new pods being evicted as OOMScoreAdj will be high.
+Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.)
+For now, we will introduce an error mode in the kubelet to inform users about the shrink in the available resources in case of hotunplug.
+
+Few of the concerns surrounding hotunplug are listed below
+* Pod re-admission:
+    * Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running
+      or if it has to be terminated due to resource crunch.
+* Recalculate OOM adjust score and Swap limits:
+    * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
+* Handling unplug of reserved CPUs.
+
+we intend to propose a separate KEP dedicated to hotunplug of resources to address the same.
 
 ### User Stories
 
@@ -811,9 +821,10 @@ VMs of cluster should support hot plug of compute resources for e2e tests.
 * Support hot-unplug of node resources:
   * Pod re-admission:
     * Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running
-      or if it has to be terminated due to memory crunch.
+      or if it has to be terminated due to resource crunch.
   * Recalculate OOM adjust score and Swap limits:
     * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
+  * Handling unplug of reserved CPUs.
   
 * Fetching machine info via CRI
     * At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose.
diff --git a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
index db8a5c4e903..923504a7473 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
+++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
@@ -19,6 +19,7 @@ approvers:
   - "@haircommander"
   - "@SergeyKanzhelev"
   - "@ffromani"
+  - "@mrunalp"
 see-also:
 replaces:
 

From e8d8128087ea6e49ed2d68520164c4857d14a911 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Thu, 13 Feb 2025 19:57:53 +0530
Subject: [PATCH 13/19] Update HotUnplug Scenario

Co-authored-by: kishen-v <kishen.viswanathan@ibm.com>
---
 .../3953-node-resource-hot-plug/README.md     | 30 +++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 9798fd1ecd2..84f92975829 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -19,6 +19,7 @@ tags, and then generate with `hack/update-toc.sh`.
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
   - [Handling HotUnplug Events](#handling-hotunplug-events)
+    - [Flow Control](#flow-control)
   - [User Stories](#user-stories)
     - [Story 1](#story-1)
     - [Story 2](#story-2)
@@ -156,6 +157,35 @@ With this proposal its also necessary to recalculate and update OOMScoreAdj and
 Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.)
 For now, we will introduce an error mode in the kubelet to inform users about the shrink in the available resources in case of hotunplug.
 
+As the hotunplug events are not completely handled in this KEP, in such cases, it is imperative to move the node to the NotReady state when the current capacity of the node
+is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention.
+
+Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration.
+In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug.
+
+#### Flow Control
+
+```
+T=0: Node initial Resources:
+    - Memory: 10G
+    - Node state: Ready
+
+T=1: Resize Instance to Hotplug Memory
+    - Current Memory: 10G
+    - Update Memory: 15G
+    - Node state: Ready
+
+T=2: Resize Instance to HotUnplug Memory
+    - Current Memory: 15G
+    - UpdatedMemory: 5G
+    - Node state: NotReady
+
+T=3: Resize Instance to Hotplug Memory
+    - Current Memory: 5G
+    - Updated Memory Size: 15G
+    - Node state: Ready
+```
+
 Few of the concerns surrounding hotunplug are listed below
 * Pod re-admission:
     * Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running

From 66bccdf77f7479e6fd43aaaffa03f3898b1ece64 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Fri, 14 Mar 2025 12:18:06 +0530
Subject: [PATCH 14/19] Address review comments Co-authored-by: kishen-v
 <kishen.viswanathan@ibm.com>

---
 .../3953-node-resource-hot-plug/README.md     | 279 +++++++++++-------
 .../3953-node-resource-hot-plug/kep.yaml      |   2 +-
 2 files changed, 173 insertions(+), 108 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 84f92975829..fabb53bb311 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -18,8 +18,6 @@ tags, and then generate with `hack/update-toc.sh`.
   - [Goals](#goals)
   - [Non-Goals](#non-goals)
 - [Proposal](#proposal)
-  - [Handling HotUnplug Events](#handling-hotunplug-events)
-    - [Flow Control](#flow-control)
   - [User Stories](#user-stories)
     - [Story 1](#story-1)
     - [Story 2](#story-2)
@@ -29,11 +27,15 @@ tags, and then generate with `hack/update-toc.sh`.
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
+  - [Handling hotplug events](#handling-hotplug-events)
+    - [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers)
+  - [Handling HotUnplug Events](#handling-hotunplug-events)
+    - [Flow Control](#flow-control)
   - [Test Plan](#test-plan)
       - [Unit tests](#unit-tests)
       - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
-    - [Phase 1: Alpha (target 1.33)](#phase-1-alpha-target-133)
+    - [Phase 1: Alpha (target 1.34)](#phase-1-alpha-target-134)
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
       - [Upgrade](#upgrade)
       - [Downgrade](#downgrade)
@@ -147,55 +149,11 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
 
 ## Proposal
 
-This KEP strives to enable node resource hot plugging by incorporating a polling mechanism within the kubelet to retrieve machine-information from cAdvisor's cache, which is already updated periodically.
-The kubelet will periodically fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster.
+This KEP strives to enable node resource hot plugging by making the kubelet to watch and retrieve machine resource information from cAdvisor's cache as and when it changes, cAdvisor's cache is already updated periodically.
+The kubelet will fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster.
 Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations.
 With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize. But this carries small overhead due to recalculation of swap and OOMScoreAdj.
 
-### Handling HotUnplug Events
-
-Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.)
-For now, we will introduce an error mode in the kubelet to inform users about the shrink in the available resources in case of hotunplug.
-
-As the hotunplug events are not completely handled in this KEP, in such cases, it is imperative to move the node to the NotReady state when the current capacity of the node
-is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention.
-
-Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration.
-In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug.
-
-#### Flow Control
-
-```
-T=0: Node initial Resources:
-    - Memory: 10G
-    - Node state: Ready
-
-T=1: Resize Instance to Hotplug Memory
-    - Current Memory: 10G
-    - Update Memory: 15G
-    - Node state: Ready
-
-T=2: Resize Instance to HotUnplug Memory
-    - Current Memory: 15G
-    - UpdatedMemory: 5G
-    - Node state: NotReady
-
-T=3: Resize Instance to Hotplug Memory
-    - Current Memory: 5G
-    - Updated Memory Size: 15G
-    - Node state: Ready
-```
-
-Few of the concerns surrounding hotunplug are listed below
-* Pod re-admission:
-    * Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running
-      or if it has to be terminated due to resource crunch.
-* Recalculate OOM adjust score and Swap limits:
-    * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
-* Handling unplug of reserved CPUs.
-
-we intend to propose a separate KEP dedicated to hotunplug of resources to address the same.
-
 ### User Stories
 
 #### Story 1
@@ -247,94 +205,197 @@ detect the change in compute capacity, which can bring in additional complicatio
 - Lack of coordination about change in resource availability across kubelet/runtime/plugins.
   - The plugins/runtime should be updated to react to change in resource information on the node.
 
-- Kubelet missing hotplug event or too many hotplug events
-  - Hotplug events are captured via periodic polling by the kubelet, this ensures that the capacity is updated in the poll cycle and can technically not miss the event/fail to handle a flood of events.
+- Kubelet missing on processing hotplug instance(s)
+  - Kubelet observes the underlying node for any hotplug of resources as and when generated, 
+    this ensures that the capacity is updated in set intervals and can technically not miss to update the actual capacity obtained from cAdvisor.
 
 - Handling downsize events
-  - Though there is no support through this KEP to handle an event of node-downsize, it's the onus of the cluster administrator to resize responsibly to avoid disruption as it lies out of the kubernetes realm.
-  - However, enabling this feature will ensure that the correct resource information is pushed across the cluster.
+  - Though, there is no support through this KEP to handle an event of node-downsize, it's the onus of the cluster administrator to resize responsibly to avoid disruption as it lies out of the kubernetes realm.
+  - However, in a situation of downsize an error mode is returned by the kubelet and the node is marked as `NotReady`.
+
+- Workloads that are dependent on the initial node configuration, such as:
+  - Workloads that spawns per-CPU processes (threads, workpools, etc.)
+  - Workloads that depend on the CPU-Memory relationships (e.g Processes that depend on NUMA/NUMA alignment.)
+  - Dependency of external libraries/device drivers to support CPU hotplug as a supported feature.
+
 
 ## Design Details
 
 Below diagram is shows the interaction between kubelet, node and cAdvisor.
 
-
 ```mermaid
 sequenceDiagram
     participant node
     participant kubelet
     participant cAdvisor-cache
     participant machine-info
-    kubelet->>cAdvisor-cache: Poll
+    kubelet->>cAdvisor-cache: fetch
     cAdvisor-cache->>machine-info: fetch
     machine-info->>cAdvisor-cache: update
     cAdvisor-cache->>kubelet: update
-    kubelet->>node: node status update
+    alt if increase in resource
+    kubelet->>node: recalculate and update OOMScoreAdj <br> and Swap limit of containers
     kubelet->>node: re-initialize resource managers
-    kubelet->>node: Recalculate and update OOMScoreAdj <br> and Swap limit of pods
+    kubelet->>node: node status update with new capacity
+    else if decrease in resource
+    kubelet->>node: set node status to not ready
+    end
 ```
 
-The interaction sequence is as follows
-1. Kubelet will be polling in interval to fetch the machine resource information from cAdvisor's cache, Which is currently updated every 5 minutes.
-2. Kubelet's cache will be updated with the latest machine resource information.
-3. Node status updater will update the node's status with the latest resource information.
-4. Kubelet will reinitialize the resource managers to keep them up to date with dynamic resource changes.
-5. Kubelet will recalculate and update the OOMScoreAdj and swap limit for the existing pods.
-
-With increase in cluster resources the following components will be updated
-
-1. Scheduler
-   * Scheduler will automatically schedule any pending pods.
-
-2. Update in Node allocatable capacity.
-
-3. Resource managers will re-initialised.
-
-4. Change in OOM score adjust
+The interaction sequence is as follows:
+1. Kubelet will fetch machine resource information from cAdvisor's cache, Which is configurable a flag in cAdvisor `update_machine_info_interval`.
+2. If the machine resource is increased:
+    * Recalculate, update OOMScoreAdj and Swap limit of all the running containers.
+    * Re-initialize resource managers.
+    * Update node with new resource.
+3. If the machine resource is decreased.
+    * Set node status to not ready. (This will be reverted when the current capacity exceeds or matches either the previous hot-plug capacity or the initial capacity 
+      in case there was no history of hotplug.)
+
+With increase in cluster resources the following components will be updated:
+1. Change in OOM score adjust:
     * Currently, the OOM score adjust is calculated by
       `1000 - (1000*containerMemReq)/memoryCapacity`
     * Increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize and also recalculate the same for existing pods.
 
-5. Change in Swap Memory limit
+2. Change in Swap Memory limit:
    * Currently, the swap memory limit is calculated by 
  `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
    * Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
 
+3. Resource managers will re-initialised.
+
+4. Update in Node allocatable capacity.
+
+5. Scheduler:
+    * Scheduler will automatically schedule any pending pods.
+    * This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler `watches` the
+      available capacity of the node and creates pods accordingly.
+
+
+### Handling hotplug events
+
+Once the capacity of the node is altered, the following are the sequence of events that occur in the kubelet. If any errors are 
+observed in any of the steps, operation is retried from step 1 along with a `FailedNodeResize` event under the node object.
+1. Resizing existing containers:
+    a.With the increased memory capacity of the nodes, the kubelet proceeds to update fields that are directly related to
+      the available memory on the host. This would lead to recalculation of oom_score_adj and swap_limits.
+    b.This is achieved by invoking the CRI API - UpdateContainerResources.
+
+2. Reinitialise Resource Manager:
+     a. Resource managers such as CPU,Memory are updated with the latest available capacities on the host. This posts the latest
+        available capacities under the node.
+     b. This is achieved by calling ResyncComponents() of ContainerManager interface to re-sync the resource managers.
+3. Updating the node allocatable resources:
+     a. As the scheduler keeps a tab on the available resources of the node, post updating the available capacities,
+        the scheduler proceeds to schedule any pending pods.
+
+
+#### Flow Control for updating swap limit for containers
+
+Formula to calculate Swap Limit: `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
+```
+T=0: Node Resources:
+        - Memory: 6G
+        - Swap: 4G
+     Pod:
+        - container1
+            - MemoryRequest: 2G
+            - State: Running
+     Runtime:
+        - <cgroup_path>/memory.swap.max: 1.33G
+        
+T=1: Resize Instance to Hotplug Memory:
+        - Memory: 8G
+        - Swap: 4G
+     Pod:
+        - container1
+            - MemoryRequest: 2G
+            - State: Running
+     Runtime:
+        - <cgroup_path>/memory.swap.max: 1G
+```
+
+Similar flow is applicable for updating oom_score_adj.
+
+### Handling HotUnplug Events
+
+Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.)
+For now, we will introduce an error mode in the kubelet to inform users about the shrink in the available resources in case of hotunplug.
+
+As the hot-unplug events are not completely handled in this KEP, in such cases, it is imperative to move the node to the NotReady state when the current capacity of the node
+is lesser than the initial capacity of the node. This is only to point at the fact that the resources have shrunk on the node and may need attention/intervention.
+
+Once the node has transitioned to the NotReady state, it will be reverted to the ReadyState once when the node's capacity is reconfigured to match or exceed the last valid configuration.
+In this case, valid configuration refers to a state which can either be previous hot-plug capacity or the initial capacity in case there was no history of hotplug.
+
+#### Flow Control
+
+```
+T=0: Node initial Resources:
+    - Memory: 10G
+    - Pod: Memory
+
+T=1: Resize Instance to Hotplug Memory
+    - Current Memory: 10G
+    - Update Memory: 15G
+    - Node state: Ready
+
+T=2: Resize Instance to HotUnplug Memory
+    - Current Memory: 15G
+    - UpdatedMemory: 5G
+    - Node state: NotReady
+
+T=3: Resize Instance to Hotplug Memory
+    - Current Memory: 5G
+    - Updated Memory Size: 15G
+    - Node state: Ready
+```
+
+Few of the concerns surrounding hotunplug are listed below
+* Pod re-admission:
+    * Given that there is probability that the current Pod resource usage may exceed the available capacity of node, its necessary to check if the pod can continue Running
+      or if it has to be terminated due to resource crunch.
+* Recalculate OOM adjust score and Swap limits:
+    * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
+* Handling unplug of reserved CPUs.
+
+we intend to propose a separate KEP dedicated to hotunplug of resources to address the same.
+
 **Proposed Code changes**
 
 **Pseudocode for Resource Hotplug**
 
 ```go
-	if utilfeature.DefaultFeatureGate.Enabled(features.NodeResourceHotPlug) {
-		// Handle the node dynamic scale up
-		machineInfo, err := kl.cadvisor.MachineInfo()
-		if err != nil {
-			klog.ErrorS(err, "Error fetching machine info")
-		} else {
-			cachedMachineInfo, _ := kl.GetCachedMachineInfo()
-            // Avoid collector collects it as a timestamped metric
-            // See PR #95210 and #97006 for more details.
-                machineInfo.Timestamp = time.Time{}
-			if !reflect.DeepEqual(cachedMachineInfo, machineInfo) {
-				kl.setCachedMachineInfo(machineInfo)
-
-				// Resync the resource managers
-				if err := kl.containerManager.ResyncComponents(machineInfo); err != nil {
-					klog.ErrorS(err, "Error resyncing the kubelet components with machine info")
-				}
-				
-				// Recalculate OOMScoreAdj and Swap Limit.
-				// NOTE: we will make use UpdateContainerResources CRI method to update the values.
-                if err := kl.RecalculateOOMScoreAdjAndSwap(); err != nil {
-                    klog.ErrorS(err, "Error recalculating OOMScoreAdj and Swap")
-                }
-				
-			}
-		}
-	}
+func (kl *Kubelet) syncLoopIteration(ctx context.Context, configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
+syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
+    .
+    .
+    case machineInfo := <-kl.nodeResourceManager.MachineInfo():
+        // Resize the containers.
+        klog.InfoS("Resizing containers due to change in MachineInfo")
+        if err := resizeContainers(); err != nil {
+            klog.ErrorS(err, "Failed to resize containers with change in machine info")
+            kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.FailedNodeResize, err.Error())
+            break
+        }
+        
+        // Resync the resource managers.
+        klog.InfoS("ResyncComponents resource managers because of change in MachineInfo")
+        if err := kl.containerManager.ResyncComponents(machineInfo); err != nil {
+            klog.ErrorS(err, "Failed to resync resource managers with machine info update")
+            kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.FailedNodeResize, err.Error())
+            break
+        }
+        
+        // Update the cached MachineInfo.
+        kl.setCachedMachineInfo(machineInfo)
+    .
+    .
+}
 ```
 
-**Changes to resource managers to adapt to dynamic scale up of resources**
+**Changes to resource managers to adapt to hotplug of resources**
 
 1. Adding ResyncComponents() method to ContainerManager interface
 ```go
@@ -369,7 +430,7 @@ to implement this enhancement.
 2. Add necessary tests in kubelet_pods_test.go to check for the pod cleanup and pod addition workflow.
 3. Add necessary tests in eventhandlers_test.go to check for scheduler behaviour with dynamic node capacity change.
 4. Add necessary tests in resource managers to check for managers behaviour to adopt dynamic node capacity change.
-
+5. Add necessary tests to validate change in oom_score and swap limit for containers post resize.
 
 ##### e2e tests
 
@@ -385,8 +446,7 @@ Following scenarios need to be covered:
 ### Graduation Criteria
 
 
-#### Phase 1: Alpha (target 1.33)
-
+#### Phase 1: Alpha (target 1.34)
 
 * Feature is disabled by default. It is an opt-in feature which can be enabled by enabling the `NodeResourceHotPlug`
   feature gate.
@@ -835,7 +895,8 @@ nodes capacity may be lower than existing workloads memory requirement.
 
 ## Alternatives
 
-Horizontally scale the cluster by incorporating additional compute nodes.
+* Horizontally scale the cluster by incorporating additional compute nodes.
+* Use fake placeholder resources that are available but not enabled (e.g., balloon drivers)
 
 <!--
 What other approaches did you consider, and why did you rule them out? These do
@@ -860,3 +921,7 @@ VMs of cluster should support hot plug of compute resources for e2e tests.
     * At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose.
     * Presently, resource managers are updated through regular polling. Once the CRI APIs are enhanced to fetch machine information, we can significantly enhance the reinitialization of resource managers, 
       enabling them to respond more effectively to resize events.
+
+* Knobs to alter Kube and System reserved
+    * Currently, these values are calculated and set by individual cloud providers or vendors.
+    * This can be further explored to enable options to set the kube and system reserved capacities as tunables.
\ No newline at end of file
diff --git a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
index 923504a7473..89444d4672c 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
+++ b/keps/sig-node/3953-node-resource-hot-plug/kep.yaml
@@ -29,7 +29,7 @@ stage: "alpha"
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.33"
+latest-milestone: "v1.34"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:

From cc55e76f3af692d3e80c01753aa98db478eeea30 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Wed, 7 May 2025 18:08:44 +0530
Subject: [PATCH 15/19] Address review comments

---
 .../3953-node-resource-hot-plug/README.md     | 24 +++++++++----------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index fabb53bb311..0d186e69150 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -99,8 +99,7 @@ aims to optimize resource management, improve scalability, and minimize disrupti
 
 ## Motivation
 Currently, the node's resource configurations are recorded solely during the kubelet bootstrap phase and subsequently cached, assuming the node's compute capacity remains unchanged throughout the cluster's lifecycle.
-In a conventional Kubernetes environment, the cluster resources might necessitate modification because of inaccurate resource allocation during cluster initialization or escalating workload over time, 
-necessitating supplementary resources within the cluster.
+In a conventional Kubernetes environment, cluster resources might need modification because of inaccurate resource allocation or due to escalating workloads over time, requiring supplementary resources within the cluster.
 
 Contemporarily, kernel capabilities enable the dynamic addition of CPUs and memory to a node (for example: https://docs.kernel.org/core-api/cpu_hotplug.html and https://docs.kernel.org/core-api/memory-hotplug.html).
 This can be across different architecture and compute environments like Cloud, Bare metal or VM. During this exercise it can lead to Kubernetes being unaware of the node's altered compute capacities during a live-resize,
@@ -119,7 +118,7 @@ However, this approach does carry a few drawbacks such as
    - https://github.com/kubernetes/kubernetes/issues/125579
    - https://github.com/kubernetes/kubernetes/issues/127793
 
-Hence, it is necessary to handle the updates in the compute capacity in a graceful fashion across the cluster, than adopting to reset the cluster components to achieve the same.
+Hence, it is necessary to handle capacity updates gracefully across the cluster, rather than resetting the cluster components to achieve the same outcome.
 
 Also, given that the capability to live-resize a node exists in the Linux and Windows kernels, enabling the kubelet to be aware of the underlying changes in the node's compute capacity will mitigate any actions that are required to be made
 by the Kubernetes administrator.
@@ -152,7 +151,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
 This KEP strives to enable node resource hot plugging by making the kubelet to watch and retrieve machine resource information from cAdvisor's cache as and when it changes, cAdvisor's cache is already updated periodically.
 The kubelet will fetch this information, subsequently entrusting the node status updater to disseminate these updates at the node level across the cluster.
 Moreover, this KEP aims to refine the initialization and reinitialization processes of resource managers, including the memory manager and CPU manager, to ensure their adaptability to changes in node configurations.
-With this proposal its also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize. But this carries small overhead due to recalculation of swap and OOMScoreAdj.
+With this proposal it's also necessary to recalculate and update OOMScoreAdj and swap limit for the pods that had been existing before resize. But this carries a small overhead due to recalculation of swap and OOMScoreAdj.
 
 ### User Stories
 
@@ -202,8 +201,8 @@ detect the change in compute capacity, which can bring in additional complicatio
 - Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
   - To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
 
-- Lack of coordination about change in resource availability across kubelet/runtime/plugins.
-  - The plugins/runtime should be updated to react to change in resource information on the node.
+- Lack of coordination about change in resource availability across kubelet/runtime/NRI plugins.
+  - The runtime/NRI plugins should be updated to react to change in resource information on the node.
 
 - Kubelet missing on processing hotplug instance(s)
   - Kubelet observes the underlying node for any hotplug of resources as and when generated, 
@@ -221,7 +220,7 @@ detect the change in compute capacity, which can bring in additional complicatio
 
 ## Design Details
 
-Below diagram is shows the interaction between kubelet, node and cAdvisor.
+The diagram below shows the interaction between kubelet, node and cAdvisor.
 
 ```mermaid
 sequenceDiagram
@@ -263,9 +262,9 @@ With increase in cluster resources the following components will be updated:
  `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
    * Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
 
-3. Resource managers will re-initialised.
+3. Resource managers are re-initialised.
 
-4. Update in Node allocatable capacity.
+4. Update in Node capacity.
 
 5. Scheduler:
     * Scheduler will automatically schedule any pending pods.
@@ -439,7 +438,7 @@ Following scenarios need to be covered:
 * Node resource information before and after resource hot plug for the following scenarios.
   * upsize -> downsize
   * upsize -> downsize -> upsize
-  * downsize- > upsize
+  * downsize -> upsize
 * State of Pending pods due to lack of resources after resource hot plug.
 * Resource manager states after the resync of components.
 
@@ -593,8 +592,7 @@ will rollout across nodes.
 -->
 
 Rollout may fail if the resource managers are not re-synced properly due to programmatic errors.
-In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain
-in the pending state only.
+In case of rollout failures, running workloads are not affected, If the pods are on pending state they remain pending.
 Rollback failure should not affect running workloads.
 
 ###### What specific metrics should inform a rollback?
@@ -915,7 +913,7 @@ VMs of cluster should support hot plug of compute resources for e2e tests.
       or if it has to be terminated due to resource crunch.
   * Recalculate OOM adjust score and Swap limits:
     * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
-  * Handling unplug of reserved CPUs.
+  * Handling unplug of reserved and exclusively allocated cpus CPUs.
   
 * Fetching machine info via CRI
     * At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose.

From 9db91aec70d5fe9f9d82ce812baf2ab76375a04f Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Wed, 14 May 2025 12:13:19 +0530
Subject: [PATCH 16/19] Address review comments

---
 keps/sig-node/3953-node-resource-hot-plug/README.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 0d186e69150..7c9b89f5aa5 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -88,10 +88,11 @@ Hotplug: Dynamically add compute resources (CPU, Memory, Swap Capacity and HugeP
 
 Hotunplug: Dynamically remove compute resources (CPU, Memory, Swap Capacity and HugePages) to the node, either via software (make resources go offline) or via hardware (physical removal while the system is running)
 
+Node Compute Resource: CPU, Memory, Swap Capacity and HugePages
 
 ## Summary
 
-The proposal seeks to facilitate hot plugging of node compute resources(CPU, Memory, Swap Capacity and HugePages), thereby streamlining cluster resource capacity updates through node compute resource resizing rather than introducing new nodes to the cluster.
+The proposal seeks to facilitate hot plugging of node compute resources, thereby streamlining cluster resource capacity updates through node compute resource resizing rather than introducing new nodes to the cluster.
 The revised node configurations will be automatically propagated at both the node and cluster levels.
 
 Furthermore, this proposal intends to enhance the initialization and reinitialization processes of resource managers, including the CPU manager and memory manager, in response to alterations in a node's CPU and memory configurations and 
@@ -135,7 +136,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
 ### Goals
 
 * Achieve seamless node capacity expansion through hot plugging resources.
-* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager to accommodate alterations in the node's resource allocation.
+* Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager without reset to accommodate alterations in the node's resource allocation.
 * Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods.
 
 ### Non-Goals
@@ -913,7 +914,7 @@ VMs of cluster should support hot plug of compute resources for e2e tests.
       or if it has to be terminated due to resource crunch.
   * Recalculate OOM adjust score and Swap limits:
     * Since the total capacity of the node has changed, values associated with the nodes memory capacity must be recomputed.
-  * Handling unplug of reserved and exclusively allocated cpus CPUs.
+  * Handling unplug of reserved and exclusively allocated CPUs.
   
 * Fetching machine info via CRI
     * At present, the machine data is retrieved from cAdvisor's cache through periodic checks. There is ongoing development to utilize CRI APIs for this purpose.

From 579af1b143c5b779bcffdd1b46a177b28ea23640 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Fri, 16 May 2025 18:49:32 +0530
Subject: [PATCH 17/19] Add CA compatability section

---
 .../3953-node-resource-hot-plug/README.md     | 24 +++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 7c9b89f5aa5..07decb3d47d 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -29,6 +29,7 @@ tags, and then generate with `hack/update-toc.sh`.
 - [Design Details](#design-details)
   - [Handling hotplug events](#handling-hotplug-events)
     - [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers)
+    - [Compatability with Cluster Autoscaler](#compatability-with-cluster-autoscaler)
   - [Handling HotUnplug Events](#handling-hotunplug-events)
     - [Flow Control](#flow-control)
   - [Test Plan](#test-plan)
@@ -135,7 +136,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
 
 ### Goals
 
-* Achieve seamless node capacity expansion through hot plugging resources.
+* Achieve seamless node capacity expansion through resource hotplug.
 * Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager without reset to accommodate alterations in the node's resource allocation.
 * Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods.
 
@@ -143,7 +144,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
 
 * Dynamically adjust system reserved and kube reserved values.
 * Hot unplug of node resources.
-* Update the autoscaler to utilize resource hot plugging.
+* Update the autoscaler to utilize resource hotplug.
 * Re-balance workloads across the nodes.
 * Update runtime/NRI plugins with host resource changes.
 
@@ -278,9 +279,9 @@ With increase in cluster resources the following components will be updated:
 Once the capacity of the node is altered, the following are the sequence of events that occur in the kubelet. If any errors are 
 observed in any of the steps, operation is retried from step 1 along with a `FailedNodeResize` event under the node object.
 1. Resizing existing containers:
-    a.With the increased memory capacity of the nodes, the kubelet proceeds to update fields that are directly related to
+    a. With the increased memory capacity of the nodes, the kubelet proceeds to update fields that are directly related to
       the available memory on the host. This would lead to recalculation of oom_score_adj and swap_limits.
-    b.This is achieved by invoking the CRI API - UpdateContainerResources.
+    b. This is achieved by invoking the CRI API - UpdateContainerResources.
 
 2. Reinitialise Resource Manager:
      a. Resource managers such as CPU,Memory are updated with the latest available capacities on the host. This posts the latest
@@ -318,6 +319,21 @@ T=1: Resize Instance to Hotplug Memory:
 
 Similar flow is applicable for updating oom_score_adj.
 
+#### Compatability with Cluster Autoscaler
+
+The Cluster Autoscaler (CA) presently anticipates uniform allocatable values among nodes within the same NodeGroup, using existing Nodes as templates for 
+newly provisioned Nodes from the same NodeGroup. However, with the introduction of NodeResourceHotplug, this assumption may no longer hold true.
+If not appropriately addressed, this could cause the Cluster Autoscaler to randomly select a Node from the group and assume identical allocatable values for all upcoming Nodes. 
+This could lead to suboptimal decisions, such as repeatedly attempting to provision Nodes for pending Pods that are incompatible, or overlooking potential Nodes that could accommodate such Pods.
+
+To ensure the Cluster Autoscaler acknowledges resource hotplug, the following approaches have been proposed by the Cluster Autoscaler team:
+1. Capture Node's Initial Allocatable Values:
+   * Introduce a new field within the Node object to record initial node allocatable values, which remain unchanged during resource hotplug.
+   * The Cluster Autoscaler can leverage this field to anticipate potential hotplug of resources, using it as a template for configuring new Nodes.
+
+2. Identify Nodes Affected by Hotplug:
+   * By flagging a Node as being impacted by hotplug, the Cluster Autoscaler could revert to a less reliable but more conservative "scale from 0 nodes" logic.
+
 ### Handling HotUnplug Events
 
 Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.)

From 29835a8806dff90107781bf4c082b349011ff3d9 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Wed, 21 May 2025 16:39:17 +0530
Subject: [PATCH 18/19] Update OOMScoreAdj formula

---
 .../3953-node-resource-hot-plug/README.md     | 49 +++++++++----------
 1 file changed, 24 insertions(+), 25 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 07decb3d47d..289c2eada03 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -29,7 +29,7 @@ tags, and then generate with `hack/update-toc.sh`.
 - [Design Details](#design-details)
   - [Handling hotplug events](#handling-hotplug-events)
     - [Flow Control for updating swap limit for containers](#flow-control-for-updating-swap-limit-for-containers)
-    - [Compatability with Cluster Autoscaler](#compatability-with-cluster-autoscaler)
+    - [Compatibility with Cluster Autoscaler](#compatibility-with-cluster-autoscaler)
   - [Handling HotUnplug Events](#handling-hotunplug-events)
     - [Flow Control](#flow-control)
   - [Test Plan](#test-plan)
@@ -138,7 +138,7 @@ Implementing this KEP will empower nodes to recognize and adapt to changes in th
 
 * Achieve seamless node capacity expansion through resource hotplug.
 * Enable the re-initialization of resource managers (CPU manager, memory manager) and kube runtime manager without reset to accommodate alterations in the node's resource allocation.
-* Recalculating and updating the OOMScoreAdj and swap memory limit for existing pods.
+* Recalculating and updating swap memory limit for existing pods.
 
 ### Non-Goals
 
@@ -187,12 +187,6 @@ detect the change in compute capacity, which can bring in additional complicatio
 
 ### Risks and Mitigations
 
-- Change in OOMScoreAdjust value:
-    - The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity`
-    - With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the
-      actual OOMScoreAdjust for existing pods.
-        - This can be mitigated by recalculating the OOMScoreAdjust value for the existing pods. However, there can be an associated overhead for
-          recalculating the scores.
 - Change in Swap limit:
     - The formula to calculate the swap limit is `<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
     - With change in nodeTotalMemory and totalPodsSwapAvailable post up-scale, The existing swap limit may not be inline with the
@@ -200,6 +194,17 @@ detect the change in compute capacity, which can bring in additional complicatio
         - This can be mitigated by recalculating the swap limit for the existing pods. However, there can be an associated overhead for
           recalculating the scores.
 
+- Change in OOMScoreAdjust value:
+    - The formula to calculate OOMScoreAdjust is `1000 - (1000*containerMemReq)/memoryCapacity`
+    - With change in memoryCapacity post up-scale, The existing OOMScoreAdjust may not be inline with the
+      actual OOMScoreAdjust for existing pods.
+    - Its not recommended to update the OOMScoreAdjust of a running container as OOMScoreAdjust value is set for init process(pid 1) which is 
+      responsible for running all other container's processes.
+    - When we update OOMScoreAdjust for a running container, its set for container init only, and possibly processes which will be started later and
+      running won't get the OOMScoreAdjust new value.
+        - This can be mitigated by updating the OOMScoreAdj formula to not consider current memory value, hence the new OOMScoreAdj formula looks like this
+            `min(0, 1000 - (1000*containerMemoryRequest)/initialMemoryCapacity)`
+
 - Post up-scale any failure in resync of Resource managers may be lead to incorrect or rejected allocation, which can lead to underperformed or rejected workload.
   - To mitigate the risks adequate tests should be added to avoid the scenarios where failure to resync resource managers can occur.
 
@@ -235,7 +240,7 @@ sequenceDiagram
     machine-info->>cAdvisor-cache: update
     cAdvisor-cache->>kubelet: update
     alt if increase in resource
-    kubelet->>node: recalculate and update OOMScoreAdj <br> and Swap limit of containers
+    kubelet->>node: recalculate and update Swap limit of containers
     kubelet->>node: re-initialize resource managers
     kubelet->>node: node status update with new capacity
     else if decrease in resource
@@ -246,7 +251,7 @@ sequenceDiagram
 The interaction sequence is as follows:
 1. Kubelet will fetch machine resource information from cAdvisor's cache, Which is configurable a flag in cAdvisor `update_machine_info_interval`.
 2. If the machine resource is increased:
-    * Recalculate, update OOMScoreAdj and Swap limit of all the running containers.
+    * Recalculate, update Swap limit of all the running containers.
     * Re-initialize resource managers.
     * Update node with new resource.
 3. If the machine resource is decreased.
@@ -254,21 +259,16 @@ The interaction sequence is as follows:
       in case there was no history of hotplug.)
 
 With increase in cluster resources the following components will be updated:
-1. Change in OOM score adjust:
-    * Currently, the OOM score adjust is calculated by
-      `1000 - (1000*containerMemReq)/memoryCapacity`
-    * Increase in memoryCapacity will result in updated OOM score adjust for pods deployed post resize and also recalculate the same for existing pods.
+1. Change in Swap Memory limit:
+    * Currently, the swap memory limit is calculated by
+      `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
+    * Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
 
-2. Change in Swap Memory limit:
-   * Currently, the swap memory limit is calculated by 
- `(<containerMemoryRequest>/<nodeTotalMemory>)*<totalPodsSwapAvailable>`
-   * Increase in nodeTotalMemory or totalPodsSwapAvailable will result in updated swap memory limit for pods deployed post resize and also recalculate the same for existing pods.
+2. Resource managers are re-initialised.
 
-3. Resource managers are re-initialised.
+3. Update in Node capacity.
 
-4. Update in Node capacity.
-
-5. Scheduler:
+4. Scheduler:
     * Scheduler will automatically schedule any pending pods.
     * This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler `watches` the
       available capacity of the node and creates pods accordingly.
@@ -287,6 +287,7 @@ observed in any of the steps, operation is retried from step 1 along with a `Fai
      a. Resource managers such as CPU,Memory are updated with the latest available capacities on the host. This posts the latest
         available capacities under the node.
      b. This is achieved by calling ResyncComponents() of ContainerManager interface to re-sync the resource managers.
+
 3. Updating the node allocatable resources:
      a. As the scheduler keeps a tab on the available resources of the node, post updating the available capacities,
         the scheduler proceeds to schedule any pending pods.
@@ -317,9 +318,7 @@ T=1: Resize Instance to Hotplug Memory:
         - <cgroup_path>/memory.swap.max: 1G
 ```
 
-Similar flow is applicable for updating oom_score_adj.
-
-#### Compatability with Cluster Autoscaler
+#### Compatibility with Cluster Autoscaler
 
 The Cluster Autoscaler (CA) presently anticipates uniform allocatable values among nodes within the same NodeGroup, using existing Nodes as templates for 
 newly provisioned Nodes from the same NodeGroup. However, with the introduction of NodeResourceHotplug, this assumption may no longer hold true.

From 13467acfd96bca05d8d5bc342bd352789314d398 Mon Sep 17 00:00:00 2001
From: Karthik Bhat <karthikkn1997@gmail.com>
Date: Fri, 30 May 2025 14:53:59 +0530
Subject: [PATCH 19/19] Address reveiw comments Co-authored-by: kishen-v
 <kishen.viswanathan@ibm.com>

---
 keps/sig-node/3953-node-resource-hot-plug/README.md | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/keps/sig-node/3953-node-resource-hot-plug/README.md b/keps/sig-node/3953-node-resource-hot-plug/README.md
index 289c2eada03..c6889688ee7 100644
--- a/keps/sig-node/3953-node-resource-hot-plug/README.md
+++ b/keps/sig-node/3953-node-resource-hot-plug/README.md
@@ -269,9 +269,9 @@ With increase in cluster resources the following components will be updated:
 3. Update in Node capacity.
 
 4. Scheduler:
-    * Scheduler will automatically schedule any pending pods.
-    * This is done as an expected behavior and does not require any changes in the existing design of the scheduler, as the scheduler `watches` the
-      available capacity of the node and creates pods accordingly.
+    * Scheduler keeps trying to schedule any pending pods.
+    * The scheduler `watches` the updates to available capacity of the node and schedule pods accordingly. 
+   The scheduler is already doing this today, and this KEP does not require any changes in the scheduler implementation.
 
 
 ### Handling hotplug events
@@ -333,6 +333,9 @@ To ensure the Cluster Autoscaler acknowledges resource hotplug, the following ap
 2. Identify Nodes Affected by Hotplug:
    * By flagging a Node as being impacted by hotplug, the Cluster Autoscaler could revert to a less reliable but more conservative "scale from 0 nodes" logic.
 
+Given that this KEP and autoscaler are inter-related, the above approaches were discussed in the community with relevant stakeholders, and have decided approaching this problem through the former route. 
+The same will be targeted around the beta graduation of this KEP
+
 ### Handling HotUnplug Events
 
 Though this KEP focuses only on resource hotplug, It will enable the kubelet to capture the current available capacity of the node (Regardless of whether it was a hotplug or hotunplug of resources.)