From ae38e21049b82f5da6a84553c0675c3b37fea873 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Mon, 22 May 2017 22:51:45 -0700 Subject: [PATCH 01/14] Added initial draft of CPU manager proposal. --- contributors/design-proposals/cpu-manager.md | 343 +++++++++++++++++++ 1 file changed, 343 insertions(+) create mode 100644 contributors/design-proposals/cpu-manager.md diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md new file mode 100644 index 00000000000..1cef1a05d15 --- /dev/null +++ b/contributors/design-proposals/cpu-manager.md @@ -0,0 +1,343 @@ +# CPU Manager + +_Authors:_ + +* @ConnorDoyle - Connor Doyle <connor.p.doyle@intel.com> +* @flyingcougar - Szymon Scharmach <szymon.scharmach@intel.com> +* @sjenning - Seth Jennings <sjenning@redhat.com> + +**Contents:** + +* [Overview](#overview) +* [Proposed changes](#proposed-changes) +* [Operations and observability](#operations-and-observability) +* [Practical challenges](#practical-challenges) +* [Implementation roadmap](#implementation-roadmap) +* [Appendix A: cpuset pitfalls](#appendix-a-cpuset-pitfalls) + +## Overview + +_Problems to solve:_ + +1. Poor or unpredictable performance observed compared to virtual machine + based orchestration systems. Application latency and lower CPU + throughput compared to VMs due to cpu quota being fulfilled across all + cores, rather than exclusive cores, which results in fewer context + switches and higher cache affinity. +1. Unacceptable latency attributed to the OS process scheduler, especially + for “fast” virtual network functions (want to approach line rate on + modern server NICs.) + +_Solution requirements:_ + +1. Provide an API-driven contract from the system to a user: "if you are a + Guaranteed pod with 1 or more cores of cpu, the system will try to make + sure that the pod gets its cpu quota primarily from reserved core(s), + resulting in fewer context switches and higher cache affinity". +1. Support the case where in a given pod, one container is latency-critical + and another is not (e.g. auxillary side-car containers responsible for + log forwarding, metrics collection and the like.) +1. Do not cap CPU quota for guaranteed containers that are granted + exclusive cores, since that would be antithetical to (1) above. +1. Take physical processor topology into account in the CPU affinity policy. + +### Related issues + +* Feature: [Further differentiate performance characteristics associated + with pod level QoS](https://github.com/kubernetes/features/issues/276) + +## Proposed changes + +### CPU Manager component + +The *CPU Manager* is a new software component in Kubelet responsible for +assigning pod containers to sets of CPUs on the local node. In the +future, it may be expanded to control shared processor resources like +caches. + +The CPU manager interacts directly with the kuberuntime. The CPU Manager +is notified when containers come and go, before delegating container +creation via the container runtime interface and after the container's +destruction respectively. The CPU Manager emits CPU settings for +containers in response. + +#### Discovering CPU topology + +The CPU Manager must understand basic topology. First of all, it must +determine the number of logical CPUs (hardware threads) available for +allocation. On architectures that support [hyper-threading][ht], sibling +threads share a number of hardware resources including the cache +hierarchy. On multi-socket systems, logical CPUs co-resident on a socket +share L3 cache. Although there may be some programs that benefit from +disjoint caches, the policies described in this proposal assume cache +affinity will yield better application and overall system performance for +most cases. In all scenarios described below, we prefer to acquire logical +CPUs topologically. For example, allocating two CPUs on a system that has +hyper-threading turned on yields both sibling threads on the same +physical core. Likewise, allocating two CPUs on a non-hyper-threaded +system yields two cores on the same socket. + +##### Options for discovering topology + +1. Read and parse the virtual file [`/proc/cpuinfo`][procfs] and construct a + convenient data structure. +1. Execute a simple program like `lscpu -p` in a subprocess and construct a + convenient data structure based on the output. Here is an example of + [data structure to represent CPU topology][topo] in go. The linked package + contains code to build a ThreadSet from the output of `lscpu -p`. +1. Execute a mature external topology program like [`mpi-hwloc`][hwloc] -- + potentially adding support for the hwloc file format to the Kubelet. + +#### CPU Manager interfaces (sketch) + +```go +type CPUManagerPolicy interface { + Init(driver CPUDriver, topo CPUTopo) + Add(c v1.Container, qos QoS) error + Remove(c v1.Container, qos QoS) error +} + +type CPUDriver { + GetPods() []v1.Pod + GetCPUs(containerID string) CPUList + SetCPUs(containerID string, clist CPUList) error + // Future: RDT L3 and L2 cache masks, etc. +} + +type CPUTopo TBD + +type CPUList string + +func (c CPUList) Size() int {} + +// Returns a CPU list with size n and the remainder or +// an error if the request cannot be satisfied, taking +// into account the supplied topology. +// +// @post: c = set_union(taken, remaining), +// empty_set = set_intersection(taken, remainder) +func (c CPUList) Take(n int, topo CPUTopo) (taken CPUList, + remainder CPUList, + err error) {} + +// Returns a CPU list that includes all CPUs in c and d and no others. +// +// @post: result = set_union(c, d) +func (c CPUList) Add(d CPUList) (result CPUList) {} +``` + +Kubernetes will ship with three CPU manager policies. Only one policy is +active at a time on a given node, chosen by the operator via Kubelet +configuration. The three policies are **no-op**, **static** and **dynamic**. +Each policy is described below. + +#### Policy 1: "no-op" cpuset control [default] + +This policy preserves the existing Kubelet behavior of doing nothing +with the cgroup `cpuset.cpus` and `cpuset.mems` controls. This “no-op” +policy would become the default CPU Manager policy until the effects of +the other policies are better understood. + +#### Policy 2: "static" cpuset control + +The "static" policy allocates exclusive CPUs for containers if they are +included in a pod of "Guaranteed" [QoS class][qos] and the container's +resource limit for the CPU resource is an integer greater than or +equal to one. + +When exclusive CPUs are allocated for a container, those CPUs are +removed from the allowed CPUs of every other container running on the +node. Once allocated at pod admission time, an exclusive CPU remains +assigned to a single container for the lifetime of the pod (until it +becomes terminal.) + +##### Implementation sketch + +```go +// Implements CPUManagerPolicy +type staticManager struct { + driver CPUDriver + topo CPUTopo + // CPU list assigned to non-exclusive containers. + shared CPUList +} + +func (m *staticManager) Init(driver CPUDriver, topo CPUTopo) { + m.driver = driver + m.topo = topo +} + +func (m *staticManager) Add(c v1.Container, qos QoS) error { + if p.QoS == GUARANTEED && numExclusive(c) > 0 { + excl, err := allocate(numExclusive(c)) + if err != nil { + return err + } + m.driver.SetCPUs(c.ID, excl) + return nil + } + + // Default case: assign the shared set. + m.driver.SetCPUs(c.ID, m.shared) + return nil +} + +func (m *staticManager) Remove(c v1.Container, qos QoS) error { + m.free(m.driver.GetCPUs(c.ID)) +} + +func (m *staticManager) allocate(n int) (CPUList, err) { + excl, remaining, err := m.shared.Take(n, m.topo) + if err != nil { + return "", err + } + m.setShared(remaining) + return excl, nil +} + +func (m *staticManager) free(c CPUList) { + m.setShared(m.shared.add(c)) +} + +func (m *staticManager) setShared(c CPUList) { + prev := m.shared + m.shared = c + for _, pod := range m.driver.GetPods() { + for _, container := range p.Containers { + if driver.GetCPUs(container.ID) == prev { + driver.SetCPUs(m.shared) + } + } + } +} + +// @pre: container_qos = guaranteed +func numExclusive(c v1.Container) int { + if c.resources.requests["cpu"] % 1000 == 0 { + return c.resources.requests["cpu"] / 1000 + } + return 0 +} +``` + +##### Example pod specs and interpretation + +| Pod | Interpretation | +| ------------------------------------------ | ------------------------------ | +| Pod [Guaranteed]:
 A:
  cpu: 0.5 | Container **A** is assigned to the shared cpuset. | +| Pod [Guaranteed]:
 A:
  cpu: 2.0 | Container **A** is assigned two sibling threads on the same physical core (HT) or two physical cores on the same socket (no HT.)

The shared cpuset is shrunk to make room for the exclusively allocated CPUs. | +| Pod [Guaranteed]:
 A:
  cpu: 1.0
 A:
  cpu: 0.5 | Container **A** is assigned one exclusive CPU and container **B** is assigned to the shared cpuset. | +| Pod [Guaranteed]:
 A:
  cpu: 1.5
 A:
  cpu: 0.5 | Both containers **A** and **B** are assigned to the shared cpuset. | +| Pod [Burstable] | All containers are assigned to the shared cpuset. | +| Pod [BestEffort] | All containers are assigned to the shared cpuset. | + +#### Policy 3: "dynamic" cpuset control + +_TODO: Describe the policy._ + +##### Implementation sketch + +```go +// Implements CPUManagerPolicy. +type dynamicManager struct {} + +func (m *dynamicManager) Init(driver CPUDriver, topo CPUTopo) { + // TODO +} + +func (m *dynamicManager) Add(c v1.Container, qos QoS) error { + // TODO +} + +func (m *dynamicManager) Remove(c v1.Container, qos QoS) error { + // TODO +} +``` + +##### Example pod specs and interpretation + +| Pod | Interpretation | +| ------------------------------------------ | ------------------------------ | +| | | +| | | + +## Operations and observability + +* Checkpointing assignments + * The CPU Manager must be able to pick up where it left off in case the + Kubelet restarts for any reason. +* Read effective CPU assinments at runtime for alerting. This could be + satisfied by the checkpointing requirement. +* Configuration + * How does the CPU Manager coexist with existing kube-reserved + settings? + * How does the CPU Manager coexist with related Linux kernel + configuration (e.g. `isolcpus`.) The operator may want to specify a + low-water-mark for the size of the shared cpuset. The operator may + want to correlate exclusive cores with the isolated CPUs, in which + case the strategy outlined above where allocations are taken + directly from the shared pool is too simplistic. We could allow an + explicit pool of cores that may be exclusively allocated and default + this to the shared pool (leaving at least one core fro the shared + cpuset to be used for OS, infra and non-exclusive containers. + +## Practical challenges + +1. Synchronizing CPU Manager state with the container runtime via the + CRI. Runc/libcontainer allows container cgroup settings to be updtaed + after creation, but neither the Kubelet docker shim nor the CRI + implement a similar interface. + 1. Mitigation: [PR 46105](https://github.com/kubernetes/kubernetes/pull/46105) + +## Implementation roadmap + +### Phase 1 + +* Internal API exists to allocate CPUs to containers + ([PR 46105](https://github.com/kubernetes/kubernetes/pull/46105)) +* Kubelet configuration includes a CPU manager policy (initially only no-op) +* No-op policy is implemented. +* All existing unit and e2e tests pass. +* Initial unit tests pass. + +### Phase 2 + +* Kubelet can discover "basic" CPU topology (HT-to-physical-core map) +* Static policy is implemented. +* Unit tests for static policy pass. +* e2e tests for static policy pass. +* Performance metrics for one or more plausible synthetic workloads show + benefit over no-op policy. + +### Phase 3 + +* Dynamic policy is implemented. +* Unit tests for dynamic policy pass. +* e2e tests for dynamic policy pass. +* Performance metrics for one or more plausible synthetic workloads show + benefit over no-op policy. + +### Phase 4 + +* Kubelet can discover "advanced" CPU topology (NUMA). + +## Appendix A: cpuset pitfalls + +1. `cpuset.sched_relax_domain_level` +1. Child cpusets must be subsets of their parents. If B is a child of A, + then B must be a subset of A. Attempting to shrink A such that B + would contain allowed CPUs not in A is not allowed (the write will + fail.) Nested cpusets must be shrunk bottom-up. By the same rationale, + nested cpusets must be expanded top-down. +1. Dynamically changing cpusets by directly writing to the sysfs would + create inconsistencies with container runtimes. +1. The `exclusive` flag. This will not be used. We will achieve + exclusivity for a CPU by removing it from all other assigned cpusets. +1. Tricky semantics when cpusets are combined with CFS shares and quota. + +[ht]: http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html +[hwloc]: https://www.open-mpi.org/projects/hwloc +[procfs]: http://man7.org/linux/man-pages/man5/proc.5.html +[qos]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md +[topo]: +http://github.com/intelsdi-x/swan/tree/master/pkg/isolation/topo From a21b6adf7382ccf88a0ca42b133599af39eb4dd0 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Mon, 5 Jun 2017 15:34:15 -0700 Subject: [PATCH 02/14] Added cache alloation implementation phase. --- contributors/design-proposals/cpu-manager.md | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index 1cef1a05d15..72e7058ce3e 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -51,9 +51,9 @@ _Solution requirements:_ ### CPU Manager component The *CPU Manager* is a new software component in Kubelet responsible for -assigning pod containers to sets of CPUs on the local node. In the -future, it may be expanded to control shared processor resources like -caches. +assigning pod containers to sets of CPUs on the local node. In later +phases, the scope will expand to include caches, a critical shared +processor resource. The CPU manager interacts directly with the kuberuntime. The CPU Manager is notified when containers come and go, before delegating container @@ -291,7 +291,7 @@ func (m *dynamicManager) Remove(c v1.Container, qos QoS) error { ## Implementation roadmap -### Phase 1 +### Phase 1: No-op policy * Internal API exists to allocate CPUs to containers ([PR 46105](https://github.com/kubernetes/kubernetes/pull/46105)) @@ -300,7 +300,7 @@ func (m *dynamicManager) Remove(c v1.Container, qos QoS) error { * All existing unit and e2e tests pass. * Initial unit tests pass. -### Phase 2 +### Phase 2: Static policy * Kubelet can discover "basic" CPU topology (HT-to-physical-core map) * Static policy is implemented. @@ -309,7 +309,11 @@ func (m *dynamicManager) Remove(c v1.Container, qos QoS) error { * Performance metrics for one or more plausible synthetic workloads show benefit over no-op policy. -### Phase 3 +### Phase 3: Cache allocation + +* Static policy also manages [cache allocation][cat] on supported platforms. + +### Phase 4: Dynamic polidy * Dynamic policy is implemented. * Unit tests for dynamic policy pass. @@ -317,7 +321,7 @@ func (m *dynamicManager) Remove(c v1.Container, qos QoS) error { * Performance metrics for one or more plausible synthetic workloads show benefit over no-op policy. -### Phase 4 +### Phase 5: NUMA * Kubelet can discover "advanced" CPU topology (NUMA). @@ -335,6 +339,7 @@ func (m *dynamicManager) Remove(c v1.Container, qos QoS) error { exclusivity for a CPU by removing it from all other assigned cpusets. 1. Tricky semantics when cpusets are combined with CFS shares and quota. +[cat]: http://www.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-technologies.html [ht]: http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html [hwloc]: https://www.open-mpi.org/projects/hwloc [procfs]: http://man7.org/linux/man-pages/man5/proc.5.html From b961e4eee9a540fa87090e1d95df44e23c4ad526 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Tue, 11 Jul 2017 08:16:51 -0700 Subject: [PATCH 03/14] Fixed typos. --- contributors/design-proposals/cpu-manager.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index 72e7058ce3e..b51e1aa218f 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -226,8 +226,8 @@ func numExclusive(c v1.Container) int { | ------------------------------------------ | ------------------------------ | | Pod [Guaranteed]:
 A:
  cpu: 0.5 | Container **A** is assigned to the shared cpuset. | | Pod [Guaranteed]:
 A:
  cpu: 2.0 | Container **A** is assigned two sibling threads on the same physical core (HT) or two physical cores on the same socket (no HT.)

The shared cpuset is shrunk to make room for the exclusively allocated CPUs. | -| Pod [Guaranteed]:
 A:
  cpu: 1.0
 A:
  cpu: 0.5 | Container **A** is assigned one exclusive CPU and container **B** is assigned to the shared cpuset. | -| Pod [Guaranteed]:
 A:
  cpu: 1.5
 A:
  cpu: 0.5 | Both containers **A** and **B** are assigned to the shared cpuset. | +| Pod [Guaranteed]:
 A:
  cpu: 1.0
 B:
  cpu: 0.5 | Container **A** is assigned one exclusive CPU and container **B** is assigned to the shared cpuset. | +| Pod [Guaranteed]:
 A:
  cpu: 1.5
 B:
  cpu: 0.5 | Both containers **A** and **B** are assigned to the shared cpuset. | | Pod [Burstable] | All containers are assigned to the shared cpuset. | | Pod [BestEffort] | All containers are assigned to the shared cpuset. | @@ -313,7 +313,7 @@ func (m *dynamicManager) Remove(c v1.Container, qos QoS) error { * Static policy also manages [cache allocation][cat] on supported platforms. -### Phase 4: Dynamic polidy +### Phase 4: Dynamic policy * Dynamic policy is implemented. * Unit tests for dynamic policy pass. From a6afac8372b100d1bea5067584a0d3c87c91c147 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Wed, 12 Jul 2017 10:54:31 -0700 Subject: [PATCH 04/14] Updated code snippets to match PoC branch. --- contributors/design-proposals/cpu-manager.md | 138 +++++++------------ 1 file changed, 49 insertions(+), 89 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index b51e1aa218f..3b2f53fd319 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -87,43 +87,39 @@ system yields two cores on the same socket. contains code to build a ThreadSet from the output of `lscpu -p`. 1. Execute a mature external topology program like [`mpi-hwloc`][hwloc] -- potentially adding support for the hwloc file format to the Kubelet. +1. Re-use existing discovery functionality from cAdvisor. **(preferred initial + solution)** #### CPU Manager interfaces (sketch) ```go -type CPUManagerPolicy interface { - Init(driver CPUDriver, topo CPUTopo) - Add(c v1.Container, qos QoS) error - Remove(c v1.Container, qos QoS) error +type State interface { + GetCPUSet(containerID string) (cpuset.CPUSet, bool) + GetDefaultCPUSet() cpuset.CPUSet + GetCPUSetOrDefault(containerID string) cpuset.CPUSet + SetCPUSet(containerID string, cpuset CPUSet) + SetDefaultCPUSet(cpuset CPUSet) + Delete(containerID string) } -type CPUDriver { - GetPods() []v1.Pod - GetCPUs(containerID string) CPUList - SetCPUs(containerID string, clist CPUList) error - // Future: RDT L3 and L2 cache masks, etc. +type Manager interface { + Start() + Policy() Policy + RegisterContainer(p *Pod, c *Container, containerID string) error + UnregisterContainer(containerID string) error + State() state.Reader } -type CPUTopo TBD - -type CPUList string - -func (c CPUList) Size() int {} +type Policy interface { + Name() string + Start(s state.State) + RegisterContainer(s State, pod *Pod, container *Container, containerID string) error + UnregisterContainer(s State, containerID string) error +} -// Returns a CPU list with size n and the remainder or -// an error if the request cannot be satisfied, taking -// into account the supplied topology. -// -// @post: c = set_union(taken, remaining), -// empty_set = set_intersection(taken, remainder) -func (c CPUList) Take(n int, topo CPUTopo) (taken CPUList, - remainder CPUList, - err error) {} +type CPUSet map[int]struct{} // set operations and parsing/formatting helpers -// Returns a CPU list that includes all CPUs in c and d and no others. -// -// @post: result = set_union(c, d) -func (c CPUList) Add(d CPUList) (result CPUList) {} +type CPUTopology TBD ``` Kubernetes will ship with three CPU manager policies. Only one policy is @@ -154,69 +150,36 @@ becomes terminal.) ##### Implementation sketch ```go -// Implements CPUManagerPolicy -type staticManager struct { - driver CPUDriver - topo CPUTopo - // CPU list assigned to non-exclusive containers. - shared CPUList -} - -func (m *staticManager) Init(driver CPUDriver, topo CPUTopo) { - m.driver = driver - m.topo = topo +func (p *staticPolicy) Start(s State) { + // Iteration starts at index `1` here because CPU `0` is reserved + // for infrastructure processes. + // TODO(CD): Improve this to align with kube/system reserved resources. + shared := NewCPUSet() + for cpuid := 1; cpuid < p.topology.NumCPUs; cpuid++ { + shared.Add(cpuid) + } + s.SetDefaultCPUSet(shared) } -func (m *staticManager) Add(c v1.Container, qos QoS) error { - if p.QoS == GUARANTEED && numExclusive(c) > 0 { - excl, err := allocate(numExclusive(c)) +func (p *staticPolicy) RegisterContainer(s State, pod *Pod, container *Container, containerID string) error { + if numCPUs := numGuaranteedCPUs(pod, container); numCPUs != 0 { + // container should get some exclusively allocated CPUs + cpuset, err := p.allocateCPUs(s, numCPUs) if err != nil { return err } - m.driver.SetCPUs(c.ID, excl) - return nil + s.SetCPUSet(containerID, cpuset) } - - // Default case: assign the shared set. - m.driver.SetCPUs(c.ID, m.shared) + // container belongs in the shared pool (nothing to do; use default cpuset) return nil } -func (m *staticManager) Remove(c v1.Container, qos QoS) error { - m.free(m.driver.GetCPUs(c.ID)) -} - -func (m *staticManager) allocate(n int) (CPUList, err) { - excl, remaining, err := m.shared.Take(n, m.topo) - if err != nil { - return "", err - } - m.setShared(remaining) - return excl, nil -} - -func (m *staticManager) free(c CPUList) { - m.setShared(m.shared.add(c)) -} - -func (m *staticManager) setShared(c CPUList) { - prev := m.shared - m.shared = c - for _, pod := range m.driver.GetPods() { - for _, container := range p.Containers { - if driver.GetCPUs(container.ID) == prev { - driver.SetCPUs(m.shared) - } - } - } -} - -// @pre: container_qos = guaranteed -func numExclusive(c v1.Container) int { - if c.resources.requests["cpu"] % 1000 == 0 { - return c.resources.requests["cpu"] / 1000 +func (p *staticPolicy) UnregisterContainer(s State, containerID string) error { + if toRelease, ok := s.GetCPUSet(containerID); ok { + s.Delete(containerID) + p.releaseCPUs(s, toRelease) } - return 0 + return nil } ``` @@ -238,19 +201,16 @@ _TODO: Describe the policy._ ##### Implementation sketch ```go -// Implements CPUManagerPolicy. -type dynamicManager struct {} - -func (m *dynamicManager) Init(driver CPUDriver, topo CPUTopo) { - // TODO +func (p *dynamicPolicy) Start(s State) { + // TODO } -func (m *dynamicManager) Add(c v1.Container, qos QoS) error { - // TODO +func (p *dynamicPolicy) RegisterContainer(s State, pod *Pod, container *Container, containerID string) error { + // TODO } -func (m *dynamicManager) Remove(c v1.Container, qos QoS) error { - // TODO +func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { + // TODO } ``` From 79a2bb5c0d0d36feb28f07844a745a29f3aba5a0 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Wed, 12 Jul 2017 14:18:39 -0700 Subject: [PATCH 05/14] Minor formatting fix. --- contributors/design-proposals/cpu-manager.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index 3b2f53fd319..77ce302111c 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -304,5 +304,4 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { [hwloc]: https://www.open-mpi.org/projects/hwloc [procfs]: http://man7.org/linux/man-pages/man5/proc.5.html [qos]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md -[topo]: -http://github.com/intelsdi-x/swan/tree/master/pkg/isolation/topo +[topo]: http://github.com/intelsdi-x/swan/tree/master/pkg/isolation/topo From ca32930c93c483c2dbe75e970524220f81b6ef8c Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Thu, 13 Jul 2017 21:35:08 -0700 Subject: [PATCH 06/14] Fixed typos, described sched_relax_domain_level. --- contributors/design-proposals/cpu-manager.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index 77ce302111c..0e5721996de 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -238,13 +238,13 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { case the strategy outlined above where allocations are taken directly from the shared pool is too simplistic. We could allow an explicit pool of cores that may be exclusively allocated and default - this to the shared pool (leaving at least one core fro the shared + this to the shared pool (leaving at least one core for the shared cpuset to be used for OS, infra and non-exclusive containers. ## Practical challenges 1. Synchronizing CPU Manager state with the container runtime via the - CRI. Runc/libcontainer allows container cgroup settings to be updtaed + CRI. Runc/libcontainer allows container cgroup settings to be updated after creation, but neither the Kubelet docker shim nor the CRI implement a similar interface. 1. Mitigation: [PR 46105](https://github.com/kubernetes/kubernetes/pull/46105) @@ -287,7 +287,9 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { ## Appendix A: cpuset pitfalls -1. `cpuset.sched_relax_domain_level` +1. [`cpuset.sched_relax_domain_level`][cpuset-files]. "controls the width of + the range of CPUs over which the kernel scheduler performs immediate + rebalancing of runnable tasks across CPUs." 1. Child cpusets must be subsets of their parents. If B is a child of A, then B must be a subset of A. Attempting to shrink A such that B would contain allowed CPUs not in A is not allowed (the write will @@ -300,6 +302,7 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { 1. Tricky semantics when cpusets are combined with CFS shares and quota. [cat]: http://www.intel.com/content/www/us/en/communications/cache-monitoring-cache-allocation-technologies.html +[cpuset-files]: http://man7.org/linux/man-pages/man7/cpuset.7.html#FILES [ht]: http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html [hwloc]: https://www.open-mpi.org/projects/hwloc [procfs]: http://man7.org/linux/man-pages/man5/proc.5.html From 8714cae91b25f709ebd0f453ad3c86c51711b8a7 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Thu, 20 Jul 2017 17:08:26 -0700 Subject: [PATCH 07/14] Added CPU Manager block diagram. - Removed Policy() method from cpumanager.Manager interface. - Updated initial component description. --- contributors/design-proposals/cpu-manager.md | 22 ++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index 0e5721996de..c886880c963 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -55,11 +55,22 @@ assigning pod containers to sets of CPUs on the local node. In later phases, the scope will expand to include caches, a critical shared processor resource. -The CPU manager interacts directly with the kuberuntime. The CPU Manager -is notified when containers come and go, before delegating container -creation via the container runtime interface and after the container's -destruction respectively. The CPU Manager emits CPU settings for -containers in response. +The kuberuntime notifies the CPU manager when containers come and +go. The first such notification occurs in between the container runtime +interface calls to create and start the container. The second notification +occurs after the container is destroyed by the container runtime. The CPU +Manager writes CPU settings for containers using a new CRI method named +[`UpdateContainerResources`](https://github.com/kubernetes/kubernetes/pull/46105). +This new method is invoked from two places in the CPU manager: during each +call to `RegisterContainer` and also periodically from a separate +reconciliation loop. + +![cpu-manager-block-diagram](https://user-images.githubusercontent.com/379372/28443427-bf1b2972-6d6a-11e7-8acb-6cbe9013ac28.png) + +_CPU Manager block diagram. `Policy`, `State`, and `Topology` types are +factored out of the CPU Manager to promote reuse and to make it easier +to build and test new policies. The shared state abstraction forms a basis +for observability and checkpointing extensions._ #### Discovering CPU topology @@ -104,7 +115,6 @@ type State interface { type Manager interface { Start() - Policy() Policy RegisterContainer(p *Pod, c *Container, containerID string) error UnregisterContainer(containerID string) error State() state.Reader From 6eece46e694e70eb455c8ceee37b0f623fed1946 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Thu, 20 Jul 2017 17:12:14 -0700 Subject: [PATCH 08/14] Updated topo discovery section with decision. --- contributors/design-proposals/cpu-manager.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index c886880c963..f60364e6d18 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -88,7 +88,10 @@ hyper-threading turned on yields both sibling threads on the same physical core. Likewise, allocating two CPUs on a non-hyper-threaded system yields two cores on the same socket. -##### Options for discovering topology +**Decision:** Initially the CPU Manager will re-use the existing discovery +mechanism in cAdvisor. + +Alternate options considered for discovering topology: 1. Read and parse the virtual file [`/proc/cpuinfo`][procfs] and construct a convenient data structure. @@ -98,8 +101,6 @@ system yields two cores on the same socket. contains code to build a ThreadSet from the output of `lscpu -p`. 1. Execute a mature external topology program like [`mpi-hwloc`][hwloc] -- potentially adding support for the hwloc file format to the Kubelet. -1. Re-use existing discovery functionality from cAdvisor. **(preferred initial - solution)** #### CPU Manager interfaces (sketch) From 63d8db159cc173cf58e3a5cd92d34e48a4026e9a Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Sun, 23 Jul 2017 22:39:10 -0700 Subject: [PATCH 09/14] Tied up loose ends in proposal. - Added explanations of configuration values. - Described how the static policy should be configured for compatibility with the node allocatable settings. - Cleaned up the observability section. - Expanded blurb about checkpointing in the block diagram description. - Added sections about what happens when: - Exclusive container is admitted. - Exclusive container terminates. - Shared pool becomes empty. - Shared pool becomes nonempty. --- contributors/design-proposals/cpu-manager.md | 79 ++++++++++++++++---- 1 file changed, 65 insertions(+), 14 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index f60364e6d18..70cd97edd92 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -69,8 +69,9 @@ reconciliation loop. _CPU Manager block diagram. `Policy`, `State`, and `Topology` types are factored out of the CPU Manager to promote reuse and to make it easier -to build and test new policies. The shared state abstraction forms a basis -for observability and checkpointing extensions._ +to build and test new policies. The shared state abstraction allows +other Kubelet components to be agnostic of the CPU manager policy for +observability and checkpointing extensions._ #### Discovering CPU topology @@ -136,6 +137,10 @@ type CPUTopology TBD Kubernetes will ship with three CPU manager policies. Only one policy is active at a time on a given node, chosen by the operator via Kubelet configuration. The three policies are **no-op**, **static** and **dynamic**. + +Operators can set the active CPU manager policy through a new Kubelet +configuration setting `--cpu-manager-policy`. + Each policy is described below. #### Policy 1: "no-op" cpuset control [default] @@ -158,6 +163,26 @@ node. Once allocated at pod admission time, an exclusive CPU remains assigned to a single container for the lifetime of the pod (until it becomes terminal.) +##### Configuration + +Operators can set the number of CPUs that pods may run on through a new +Kubelet configuration setting `--cpu-manager-static-num-cpus`, which +defaults to the number of logical CPUs available on the system. +The CPU manager takes this many CPUs as initial members of the shared +pool and allocates exclusive CPUs out of it. The initial membership grows +from the highest-numbered physical core down, topologically, leaving a gap +at the "bottom end" (physical core 0.) + +Operator documentation will be updated to explain how to configure the +system to use the low-numbered physical cores for kube and system slices. + +_NOTE: Although config does exist to reserve resources for the Kubelet +and the system, it is best not to overload those values with additional +semantics. For more information see the [node allocatable proposal +document](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md). +Achieving compatible settings requires following a simple rule: +`num system CPUs = kubereserved.cpus + systemreserved.cpus + static.cpus`_ + ##### Implementation sketch ```go @@ -205,6 +230,38 @@ func (p *staticPolicy) UnregisterContainer(s State, containerID string) error { | Pod [Burstable] | All containers are assigned to the shared cpuset. | | Pod [BestEffort] | All containers are assigned to the shared cpuset. | +##### Example scenarios and interactions + +1. _A container arrives that requires exclusive cores._ + 1. Kuberuntime calls the CRI delegate to create the container. + 1. Kuberuntime registers the container with the CPU manager. + 1. CPU manager registers the container to the static policy. + 1. Static policy acquires CPUs from the default pool, by + topological-best-fit. + 1. Static policy updates the state, adding an assignment for the new + container and removing those CPUs from the default pool. + 1. CPU manager reads container assignment from the state. + 1. CPU manager updates the container resources via the CRI. + 1. Kuberuntime calls the CRI delegate to start the container. + +1. _A container that was assigned exclusive cores terminates._ + 1. Kuberuntime unregisters the container with the CPU manager. + 1. CPU manager unregisters the contaner with the static policy. + 1. Static policy adds the container's assigned CPUs back to the default + pool. + 1. Kuberuntime calls the CRI delegate to remove the container. + 1. Asynchronously, the CPU manager's reconcile loop updates the + cpuset for all containers running in the shared pool. + +1. _The shared pool becomes empty._ + 1. The CPU manager adds a taint with effect NoSchedule, NoExecute + that prevents BestEffort and Burstable QoS class pods from + running on the node. + +1. _The shared pool becomes nonempty._ + 1. The CPU manager removes the taint with effect NoSchedule, NoExecute + for BestEffort and Burstable QoS class pods. + #### Policy 3: "dynamic" cpuset control _TODO: Describe the policy._ @@ -239,18 +296,6 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { Kubelet restarts for any reason. * Read effective CPU assinments at runtime for alerting. This could be satisfied by the checkpointing requirement. -* Configuration - * How does the CPU Manager coexist with existing kube-reserved - settings? - * How does the CPU Manager coexist with related Linux kernel - configuration (e.g. `isolcpus`.) The operator may want to specify a - low-water-mark for the size of the shared cpuset. The operator may - want to correlate exclusive cores with the isolated CPUs, in which - case the strategy outlined above where allocations are taken - directly from the shared pool is too simplistic. We could allow an - explicit pool of cores that may be exclusively allocated and default - this to the shared pool (leaving at least one core for the shared - cpuset to be used for OS, infra and non-exclusive containers. ## Practical challenges @@ -259,6 +304,12 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { after creation, but neither the Kubelet docker shim nor the CRI implement a similar interface. 1. Mitigation: [PR 46105](https://github.com/kubernetes/kubernetes/pull/46105) +1. Compatibility with the `isolcpus` Linux kernel boot parameter. The operator + may want to correlate exclusive cores with the isolated CPUs, in which + case the static policy outlined above, where allocations are taken + directly from the shared pool, is too simplistic. + 1. Mitigation: defer supporting this until a new policy tailored for + use with `isolcpus` can be added. ## Implementation roadmap From 694d2f4d8dd174481af3df5644ed08475a3a9b2e Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Mon, 24 Jul 2017 08:58:03 -0700 Subject: [PATCH 10/14] Updated CPU manager configuration details. - Use existing node allocatable settings instead of a new one. --- contributors/design-proposals/cpu-manager.md | 37 ++++++++------------ 1 file changed, 15 insertions(+), 22 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index 70cd97edd92..5f77cc0b92f 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -134,12 +134,24 @@ type CPUSet map[int]struct{} // set operations and parsing/formatting helpers type CPUTopology TBD ``` +#### Configuring the CPU Manager + Kubernetes will ship with three CPU manager policies. Only one policy is active at a time on a given node, chosen by the operator via Kubelet configuration. The three policies are **no-op**, **static** and **dynamic**. -Operators can set the active CPU manager policy through a new Kubelet -configuration setting `--cpu-manager-policy`. +The active CPU manager policy is set through a new Kubelet +configuration value `--cpu-manager-policy`. + +The number of CPUs that pods may run on is set using the existing +node-allocatable configuration settings. See the [node allocatable proposal +document][node-allocatable] for details. The CPU manager will claim +`floor(node.status.allocatable.cpu)` as the number of CPUs available to assign +to pods, starting from the highest-numbered physical core and descending +topologically. + +Operator documentation will be updated to explain how to configure the +system to use the low-numbered physical cores for kube and system slices. Each policy is described below. @@ -163,26 +175,6 @@ node. Once allocated at pod admission time, an exclusive CPU remains assigned to a single container for the lifetime of the pod (until it becomes terminal.) -##### Configuration - -Operators can set the number of CPUs that pods may run on through a new -Kubelet configuration setting `--cpu-manager-static-num-cpus`, which -defaults to the number of logical CPUs available on the system. -The CPU manager takes this many CPUs as initial members of the shared -pool and allocates exclusive CPUs out of it. The initial membership grows -from the highest-numbered physical core down, topologically, leaving a gap -at the "bottom end" (physical core 0.) - -Operator documentation will be updated to explain how to configure the -system to use the low-numbered physical cores for kube and system slices. - -_NOTE: Although config does exist to reserve resources for the Kubelet -and the system, it is best not to overload those values with additional -semantics. For more information see the [node allocatable proposal -document](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md). -Achieving compatible settings requires following a simple rule: -`num system CPUs = kubereserved.cpus + systemreserved.cpus + static.cpus`_ - ##### Implementation sketch ```go @@ -367,6 +359,7 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { [cpuset-files]: http://man7.org/linux/man-pages/man7/cpuset.7.html#FILES [ht]: http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html [hwloc]: https://www.open-mpi.org/projects/hwloc +[node-allocatable]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md [procfs]: http://man7.org/linux/man-pages/man5/proc.5.html [qos]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md [topo]: http://github.com/intelsdi-x/swan/tree/master/pkg/isolation/topo From 8c9863645ce5e5efbfc52e9dd22dd520484ba28f Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Mon, 24 Jul 2017 10:54:15 -0700 Subject: [PATCH 11/14] Fixed review comments from @balajismaniam. --- contributors/design-proposals/cpu-manager.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index 5f77cc0b92f..036e754718a 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -138,10 +138,10 @@ type CPUTopology TBD Kubernetes will ship with three CPU manager policies. Only one policy is active at a time on a given node, chosen by the operator via Kubelet -configuration. The three policies are **no-op**, **static** and **dynamic**. +configuration. The three policies are **noop**, **static** and **dynamic**. The active CPU manager policy is set through a new Kubelet -configuration value `--cpu-manager-policy`. +configuration value `--cpu-manager-policy`. The default value is `noop`. The number of CPUs that pods may run on is set using the existing node-allocatable configuration settings. See the [node allocatable proposal @@ -151,7 +151,8 @@ to pods, starting from the highest-numbered physical core and descending topologically. Operator documentation will be updated to explain how to configure the -system to use the low-numbered physical cores for kube and system slices. +system to use the low-numbered physical cores for kube-reserved and +system-reserved slices. Each policy is described below. @@ -359,7 +360,7 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { [cpuset-files]: http://man7.org/linux/man-pages/man7/cpuset.7.html#FILES [ht]: http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html [hwloc]: https://www.open-mpi.org/projects/hwloc -[node-allocatable]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md +[node-allocatable]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md#phase-2---enforce-allocatable-on-pods [procfs]: http://man7.org/linux/man-pages/man5/proc.5.html [qos]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md [topo]: http://github.com/intelsdi-x/swan/tree/master/pkg/isolation/topo From 3dfe261ede6c51543bfe5567c6d254186b3b2cd3 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Mon, 24 Jul 2017 11:44:08 -0700 Subject: [PATCH 12/14] s/floor/ceiling, recommend integer allocatable.cpu --- contributors/design-proposals/cpu-manager.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index 036e754718a..e56074d8d2b 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -146,9 +146,10 @@ configuration value `--cpu-manager-policy`. The default value is `noop`. The number of CPUs that pods may run on is set using the existing node-allocatable configuration settings. See the [node allocatable proposal document][node-allocatable] for details. The CPU manager will claim -`floor(node.status.allocatable.cpu)` as the number of CPUs available to assign -to pods, starting from the highest-numbered physical core and descending -topologically. +`ceiling(node.status.allocatable.cpu)` as the number of CPUs available to +assign to pods, starting from the highest-numbered physical core and +descending topologically. It is recommended to configure an integer value for +`node.status.allocatable.cpus` when the CPU manager is enabled. Operator documentation will be updated to explain how to configure the system to use the low-numbered physical cores for kube-reserved and From 5780b4865d7e416513ca9e54ac1fd54608789258 Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Mon, 24 Jul 2017 12:39:28 -0700 Subject: [PATCH 13/14] Updated staticPolicy.Start sketch. - Observe node.status.allocatable.cpu --- contributors/design-proposals/cpu-manager.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index e56074d8d2b..3a99de443d1 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -149,7 +149,7 @@ document][node-allocatable] for details. The CPU manager will claim `ceiling(node.status.allocatable.cpu)` as the number of CPUs available to assign to pods, starting from the highest-numbered physical core and descending topologically. It is recommended to configure an integer value for -`node.status.allocatable.cpus` when the CPU manager is enabled. +`node.status.allocatable.cpu` when the CPU manager is enabled. Operator documentation will be updated to explain how to configure the system to use the low-numbered physical cores for kube-reserved and @@ -181,14 +181,13 @@ becomes terminal.) ```go func (p *staticPolicy) Start(s State) { - // Iteration starts at index `1` here because CPU `0` is reserved - // for infrastructure processes. - // TODO(CD): Improve this to align with kube/system reserved resources. - shared := NewCPUSet() - for cpuid := 1; cpuid < p.topology.NumCPUs; cpuid++ { - shared.Add(cpuid) - } - s.SetDefaultCPUSet(shared) + fullCpuset := cpuset.NewCPUSet() + for cpuid := 0; cpuid < p.topology.NumCPUs; cpuid++ { + fullCpuset.Add(cpuid) + } + // Figure out which cores shall not be used in shared pool + reserved, _ := takeByTopology(p.topology, fullCpuset, p.topology.NumReservedCores) + s.SetDefaultCPUSet(fullCpuset.Difference(reserved)) } func (p *staticPolicy) RegisterContainer(s State, pod *Pod, container *Container, containerID string) error { From 6bc03acabb2706b0314ba08afe43dcdf632523cc Mon Sep 17 00:00:00 2001 From: Connor Doyle Date: Tue, 25 Jul 2017 10:41:54 -0700 Subject: [PATCH 14/14] Fix review comments from @derekwaynecarr. --- contributors/design-proposals/cpu-manager.md | 74 +++++++++++++------- 1 file changed, 50 insertions(+), 24 deletions(-) diff --git a/contributors/design-proposals/cpu-manager.md b/contributors/design-proposals/cpu-manager.md index 3a99de443d1..80da7424920 100644 --- a/contributors/design-proposals/cpu-manager.md +++ b/contributors/design-proposals/cpu-manager.md @@ -138,29 +138,31 @@ type CPUTopology TBD Kubernetes will ship with three CPU manager policies. Only one policy is active at a time on a given node, chosen by the operator via Kubelet -configuration. The three policies are **noop**, **static** and **dynamic**. +configuration. The three policies are **none**, **static** and **dynamic**. The active CPU manager policy is set through a new Kubelet -configuration value `--cpu-manager-policy`. The default value is `noop`. +configuration value `--cpu-manager-policy`. The default value is `none`. -The number of CPUs that pods may run on is set using the existing -node-allocatable configuration settings. See the [node allocatable proposal -document][node-allocatable] for details. The CPU manager will claim +The number of CPUs that pods may run on can be implicitly controlled using the +existing node-allocatable configuration settings. See the [node allocatable +proposal document][node-allocatable] for details. The CPU manager will claim `ceiling(node.status.allocatable.cpu)` as the number of CPUs available to assign to pods, starting from the highest-numbered physical core and -descending topologically. It is recommended to configure an integer value for -`node.status.allocatable.cpu` when the CPU manager is enabled. +descending topologically. It is recommended to configure `kube-reserved` +and `system-reserved` such that their sum is an integer when the CPU manager +is enabled. This ensures that `node.status.allocatable.cpu` is also an +integer. Operator documentation will be updated to explain how to configure the system to use the low-numbered physical cores for kube-reserved and -system-reserved slices. +system-reserved cgroups. Each policy is described below. -#### Policy 1: "no-op" cpuset control [default] +#### Policy 1: "none" cpuset control [default] This policy preserves the existing Kubelet behavior of doing nothing -with the cgroup `cpuset.cpus` and `cpuset.mems` controls. This “no-op” +with the cgroup `cpuset.cpus` and `cpuset.mems` controls. This "none" policy would become the default CPU Manager policy until the effects of the other policies are better understood. @@ -169,7 +171,7 @@ the other policies are better understood. The "static" policy allocates exclusive CPUs for containers if they are included in a pod of "Guaranteed" [QoS class][qos] and the container's resource limit for the CPU resource is an integer greater than or -equal to one. +equal to one. All other containers share a set of CPUs. When exclusive CPUs are allocated for a container, those CPUs are removed from the allowed CPUs of every other container running on the @@ -177,6 +179,20 @@ node. Once allocated at pod admission time, an exclusive CPU remains assigned to a single container for the lifetime of the pod (until it becomes terminal.) +Workloads that need to know their own CPU mask, e.g. for managing +thread-level affinity, can read it from the virtual file `/proc/self/status`: + +``` +$ grep -i cpus /proc/self/status +Cpus_allowed: 77 +Cpus_allowed_list: 0-2,4-6 +``` + +Note that containers running in the shared cpuset should not attempt any +application-level CPU affinity of their own, as those settings may be +overwritten without notice (whenever exclusive cores are +allocated or deallocated.) + ##### Implementation sketch ```go @@ -239,7 +255,7 @@ func (p *staticPolicy) UnregisterContainer(s State, containerID string) error { 1. _A container that was assigned exclusive cores terminates._ 1. Kuberuntime unregisters the container with the CPU manager. - 1. CPU manager unregisters the contaner with the static policy. + 1. CPU manager unregisters the container with the static policy. 1. Static policy adds the container's assigned CPUs back to the default pool. 1. Kuberuntime calls the CRI delegate to remove the container. @@ -247,18 +263,28 @@ func (p *staticPolicy) UnregisterContainer(s State, containerID string) error { cpuset for all containers running in the shared pool. 1. _The shared pool becomes empty._ - 1. The CPU manager adds a taint with effect NoSchedule, NoExecute - that prevents BestEffort and Burstable QoS class pods from - running on the node. + 1. The CPU manager adds a node condition with effect NoSchedule, + NoExecute that prevents BestEffort and Burstable QoS class pods from + running on the node. BestEffort and Burstable QoS class pods are + evicted from the node. 1. _The shared pool becomes nonempty._ - 1. The CPU manager removes the taint with effect NoSchedule, NoExecute - for BestEffort and Burstable QoS class pods. + 1. The CPU manager removes the node condition with effect NoSchedule, + NoExecute for BestEffort and Burstable QoS class pods. #### Policy 3: "dynamic" cpuset control _TODO: Describe the policy._ +Capturing discussions from resource management meetings and proposal comments: + +Unlike the static policy, when the dynamic policy allocates exclusive CPUs to +a container, the cpuset may change during the container's lifetime. If deemed +necessary, we discussed providing a signal in the following way. We could +project (a subset of) the CPU manager state into a volume visible to selected +containers. User workloads could subscribe to update events in a normal Linux +manner (e.g. inotify.) + ##### Implementation sketch ```go @@ -287,7 +313,7 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { * Checkpointing assignments * The CPU Manager must be able to pick up where it left off in case the Kubelet restarts for any reason. -* Read effective CPU assinments at runtime for alerting. This could be +* Read effective CPU assignments at runtime for alerting. This could be satisfied by the checkpointing requirement. ## Practical challenges @@ -306,23 +332,23 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { ## Implementation roadmap -### Phase 1: No-op policy +### Phase 1: None policy [TARGET: Kubernetes v1.8] * Internal API exists to allocate CPUs to containers ([PR 46105](https://github.com/kubernetes/kubernetes/pull/46105)) -* Kubelet configuration includes a CPU manager policy (initially only no-op) -* No-op policy is implemented. +* Kubelet configuration includes a CPU manager policy (initially only none) +* None policy is implemented. * All existing unit and e2e tests pass. * Initial unit tests pass. -### Phase 2: Static policy +### Phase 2: Static policy [TARGET: Kubernetes v1.8] * Kubelet can discover "basic" CPU topology (HT-to-physical-core map) * Static policy is implemented. * Unit tests for static policy pass. * e2e tests for static policy pass. * Performance metrics for one or more plausible synthetic workloads show - benefit over no-op policy. + benefit over none policy. ### Phase 3: Cache allocation @@ -334,7 +360,7 @@ func (p *dynamicPolicy) UnregisterContainer(s State, containerID string) error { * Unit tests for dynamic policy pass. * e2e tests for dynamic policy pass. * Performance metrics for one or more plausible synthetic workloads show - benefit over no-op policy. + benefit over none policy. ### Phase 5: NUMA