From 476bc3bf8c24cbb5a825f03b4361b7a8949253f7 Mon Sep 17 00:00:00 2001
From: Seth Jennings <sjenning@redhat.com>
Date: Thu, 15 Dec 2016 10:52:57 -0600
Subject: [PATCH 1/4] hugepage proposal

---
 contributors/design-proposals/hugepages.md | 213 +++++++++++++++++++++
 1 file changed, 213 insertions(+)
 create mode 100644 contributors/design-proposals/hugepages.md

diff --git a/contributors/design-proposals/hugepages.md b/contributors/design-proposals/hugepages.md
new file mode 100644
index 00000000000..83f692636f9
--- /dev/null
+++ b/contributors/design-proposals/hugepages.md
@@ -0,0 +1,213 @@
+## Abstract
+
+A proposal to allow huge page use by applications running in a Kubernetes
+cluster.
+
+A pod should be able to have a number of huge pages for use by the
+application.  The scheduler should be able have visibility into the node
+capacity of huge pages, for each huge page size, and make a decision about if
+the pod can be scheduled on that node.  The kubelet should report the number of
+available huge pages and set up the environment such that the pod can
+successfully use the number of huge pages requested in the pod definition.
+
+## Motivation
+
+Huge page support is needed for many large memory HPC workloads to achieve
+acceptable performance levels.
+
+This proposal is part of a larger effort to better support High Performance
+Computing (HPC) workloads in Kubernetes.
+
+### Scope
+
+This proposal only includes pre-allocated huge pages configured on the node by
+the administrator at boot time or by manual dynamic allocation.  It does not
+discuss the kubelet attempting to allocate huge pages dynamically in an attempt
+to accommodate a scheduling pod or the use of Transparent Huge Pages (THP). THP
+do not require knowledge by Kubernetes at all, but simply requires the node to
+have THP enabled and the application `madvise()` with `MADV_HUGEPAGES`
+memory regions it desires to be backed by huge pages.  Note that THP might lead
+to performance degradation on nodes with high memory utilization or
+fragmentation due to the defragmenting efforts of THP, which can lock memory
+pages.  For this reason, some applications may be designed to (or recommend) use
+pre-allocated huge pages instead of THP.
+
+The proposal is also limited to x86_64 support where two huge page sizes are
+supported: 2MB and 1GB.  The design, however, should accommodate additional huge
+page sizes available on other architectures.
+
+**NOTE: This design, as currently proposed, requires the use of pod-level
+cgroups, which are currently not available but are under development by
+@derekwaynecarr**
+
+## Background
+
+Huge pages are a hardware feature designed to reduce pressure on the Translation
+Lookaside Buffer (TLB).  The TLB is a small hardware cache of
+virtual-to-physical page mappings.  If the virtual address passed in a hardware
+instruction can be found in the TLB, the mapping can be determined quickly.  If
+not, a TLB miss occurs and the hardware must walk the in-memory page table to
+discover a physical mapping for the virtual address.
+
+Take a program that operates on a large 2MB structure as an example.  If the
+program accesses that space in such a way that one byte in each regular 4k page
+is accessed, 2MB/4kB = 512 TLB entries are needed to map the address range.  Each
+TLB miss results in an expensive walk of the page table.  However, if the
+allocation is backed by a 2MB huge page, only 1 TLB entry is required
+resulting in a highly likelihood that entry will remain in the cache and hit on
+accesses to the entire 2MB structure.
+
+On x86_64, there are two huge page sizes: 2MB and 1GB.  1GB huge pages are also
+called gigantic pages.  1GB must be enabled on kernel boot line with
+`hugepagesz=1g`. Huge pages, especially 1GB ones, should to be allocated
+early before memory fragments (i.e. at/near boot time) to increase the
+likelihood that they can be allocated successfully with minimal memory migration
+(i.e. defrag) required.
+
+## Use Cases
+
+The class of applications that benefit from huge pages typically have
+- A large memory working set
+- A sensitivity to memory access latency
+
+Example applications include:
+- Java applications can back the heap with huge pages using the `-XX:+UseLargePages` option.
+- In-memory databases
+
+Applications can generally use huge pages by calling
+- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory
+- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known Issues).
+
+### Pod Specification
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+  name: example
+spec:
+  containers:
+...
+    resources:
+      requests:
+	    hugepages: "10"
+      limits:
+	    hugepages: "10"
+  nodeSelector:
+    kubernetes.io/huge-page-size: "2MB"
+```
+
+Huge pages can not be overcommitted on a node.
+
+While a system may support multiple huge pages sizes, it is assumed that nodes
+configured with huge pages will only use one huge page size, namely the default
+page size in `cat /proc/meminfo | grep Hugepagesize`.  In Linux, this is 2MB
+unless overridden by `default_hugepagesz=1g` in the kernel boot parameters.
+
+The huge page size for the node will be reported by the kubelet as a label
+`alpha.kubernetes.io/huge-page-size` on the node resource.  This is done
+because there are a variety of huge page sizes across different hardware
+architecture and making a new resource field for each size doesn't scale.  Pods
+can do a nodeSelector on this label to land on a system with a particular huge
+page size.  This is similiar to how the `beta.kubernetes.io/arch` label
+operates.
+
+## Limits and Quota
+
+LimitRange should be able to define minimum and maximum constraints for huge
+pages, and Quota should be able to count them.
+
+## Implementation
+
+### Phase 0: Design Agreement
+
+**Target 1.5**
+
+Get design approval
+
+### Phase 1: Add huge page support
+
+**Target 1.5+**
+
+Implement huge page support with pod-level cgroups to enforce per-pod huge page
+limits (not yet available).  Enforcing huge page limits with pod-level cgroups
+avoids, at least temporarily, the need for 1) `docker` to support the
+`hugetlb` cgroup controller directly and 2) adding huge pages to the
+Container Runtime Interface (CRI)
+
+pkg/api/types.go (and v1/types.go)
+
+```
+const (
+	// CPU, in cores. (500m = .5 cores)
+	ResourceCPU ResourceName = "cpu"
+	// Memory, in bytes. (500Gi = 500GiB = 500 * 1024 * 1024 * 1024)
+	ResourceMemory ResourceName = "memory"
+...
+	ResourceHugePages ResourceName = "hugepages"
+)
+```
+
+The kubelet will report the total/available huge pages statistics from cadvisor
+in the node status as the huge page capacity/available.
+
+Modifications needed `setNodeStatusMachineInfo` in `pkg/kubelet/kubelet_node_status.go`
+and `CapacityFromMachineInfo` in `pkg/kubelet/cadvisor/util.go`.
+
+The kubelet will also need to create the `alpha.kubernetes.io/huge-page-size`
+label for its node resource (if self registering).
+
+pkg/api/unversioned/well_known_labels.go
+
+```
+const (
+...
+	LabelArch = "beta.kubernetes.io/arch"
+	LabelHugePageSize = "alpha.kubernetes.io/huge-page-size"
+)
+```
+
+The scheduler will need to ensure any huge page request defined in the pod spec can be fulfilled by a candidate node.
+
+cAdvisor will need to be modified to return the number of available huge pages.
+This is already supported in [runc/libcontainer](../../vendor/github.com/opencontainers/runc/libcontainer/cgroups/utils.go)
+
+### Phase 2: Expose huge pages in CRI
+
+*WIP*
+
+info/v1/machine.go
+
+```
+type MachineInfo struct {
+...
+	HugePages int `json:"huge_pages"`
+}
+```
+
+Add `hugetlb` cgroup controller support to docker (TODO: add docker/docker issue/PR) and expose it via the engine-api
+
+engine-api/types/container/host_config.go
+
+```
+type Resources struct {
+...
+	Ulimits              []*units.Ulimit // List of ulimits to be set in the container
+	HugePages            int64 // Huge pages limit
+...
+}
+```
+
+## Known Issues
+
+### Huge pages as shared memory
+
+For the Java use case the JVM maps the huge pages as a shared memory segment and
+memlocks them to prevent the system from moving or swapping them out.
+
+There are several issues here:
+- The user running the Java app must be a member of the gid set in the `vm.huge_tlb_shm_group` sysctl
+- sysctl `kernel.shmmax` must allow the size of the shared memory segment
+- The user's memlock ulimits must allow the size of the shared memory segment
+
+`vm.huge_tlb_shm_group` is not namespaced.

From 8e926fc623d793bb221e49e49606b125d21515c1 Mon Sep 17 00:00:00 2001
From: PiotrProkop <piotr.prokop@intel.com>
Date: Mon, 26 Jun 2017 15:36:21 +0200
Subject: [PATCH 2/4] Adding huge page volume plugin

---
 contributors/design-proposals/hugepages.md | 87 ++++++++++++++++++++--
 1 file changed, 81 insertions(+), 6 deletions(-)

diff --git a/contributors/design-proposals/hugepages.md b/contributors/design-proposals/hugepages.md
index 83f692636f9..f4ff4b37365 100644
--- a/contributors/design-proposals/hugepages.md
+++ b/contributors/design-proposals/hugepages.md
@@ -4,7 +4,7 @@ A proposal to allow huge page use by applications running in a Kubernetes
 cluster.
 
 A pod should be able to have a number of huge pages for use by the
-application.  The scheduler should be able have visibility into the node
+application.  The scheduler should be able to have visibility into the node
 capacity of huge pages, for each huge page size, and make a decision about if
 the pod can be scheduled on that node.  The kubelet should report the number of
 available huge pages and set up the environment such that the pod can
@@ -12,8 +12,8 @@ successfully use the number of huge pages requested in the pod definition.
 
 ## Motivation
 
-Huge page support is needed for many large memory HPC workloads to achieve
-acceptable performance levels.
+Huge page support is needed for many large memory HPC workloads or DPDK based NFV
+solutions to achieve acceptable performance levels.
 
 This proposal is part of a larger effort to better support High Performance
 Computing (HPC) workloads in Kubernetes.
@@ -30,7 +30,11 @@ memory regions it desires to be backed by huge pages.  Note that THP might lead
 to performance degradation on nodes with high memory utilization or
 fragmentation due to the defragmenting efforts of THP, which can lock memory
 pages.  For this reason, some applications may be designed to (or recommend) use
-pre-allocated huge pages instead of THP.
+pre-allocated huge pages instead of THP. 
+
+DPDK-based applications are going to request huge pages using `mmap()` system
+call and it is required that a mount point of type `hugetlbfs` is present 
+in the application's mount namespace.
 
 The proposal is also limited to x86_64 support where two huge page sizes are
 supported: 2MB and 1GB.  The design, however, should accommodate additional huge
@@ -73,9 +77,11 @@ The class of applications that benefit from huge pages typically have
 Example applications include:
 - Java applications can back the heap with huge pages using the `-XX:+UseLargePages` option.
 - In-memory databases
+- DPDK based applications
 
 Applications can generally use huge pages by calling
 - `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory
+- `mmap()` a file backed by `hugetlbfs`
 - `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known Issues).
 
 ### Pod Specification
@@ -112,6 +118,41 @@ can do a nodeSelector on this label to land on a system with a particular huge
 page size.  This is similiar to how the `beta.kubernetes.io/arch` label
 operates.
 
+### Huge page volume plugin
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+  name: example
+spec:
+  containers:
+...
+    volumeMounts:
+    - mountPath: /hugepages
+      name: hugepage
+  volumes:
+  - name: hugepage
+    hugePages:
+      pageSize: "2M"
+      size: "200M"
+      minSize: “2M” 
+```
+
+User can specify where to mount `hugetlbfs` filesystem
+inside specified container in the Pod. Volume options correspond to 
+`hugetlbfs ` mount options:
+- pageSize - if the platform supports multiple huge
+page sizes, the pagesize option can be used to specify the huge page size and
+associated pool. If pagesize is not specified the platform's default huge page 
+size and associated pool will be used.
+- size - sets the maximum value of memory (huge pages) allowed for that filesystem.
+Max size is rounded down to HPAGE_SIZE boundary. 
+- minSize - he min_size option sets the minimum value of memory (huge pages) 
+allowed for the filesystem. At mount time, the number of huge pages specified 
+by min_size are reserved for use by the filesystem.
+
+
 ## Limits and Quota
 
 LimitRange should be able to define minimum and maximum constraints for huge
@@ -125,7 +166,41 @@ pages, and Quota should be able to count them.
 
 Get design approval
 
-### Phase 1: Add huge page support
+### Phase 1: Huge page volume plugin
+
+**Target 1.8+**
+
+Implement huge page volume plugin. Accounting and limiting huge pages before 
+`Phase 2` can be done via OIR(opaque integer resources). 
+
+pkg/api/types.go ( and v1/types.go)
+
+```
+type VolumeSource struct { 
+...
+    // HugePages represensts a hugepage resource.                      
+    // +optional
+    HugePages *HugePagesVolumeSource `json:"hugePages,omitempty" protobuf:"bytes,28,opt,name=hugePages"` 
+}
+
+// HugePagesSource represents Linux HugeTlbPage https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt                                                                                    
+type HugePagesVolumeSource struct {
+    // Defaults to 2M
+    // +optional
+    PageSize string `json:"pageSize,omitempty" protobuf:"bytes,1,opt,name=pageSize"`
+    // The MaxSize option sets the maximum value of memory (huge pages).
+    // The MaxSize option is specified as resource.Quantity
+    MaxSize string `json:"size,omitempty" protobuf:"bytes,2,opt,name=size"`
+    // The MinSize option sets the minimum
+    // value of memory (huge pages) allowed for the filesystem and reserves them.
+    // The size option is specified as resource.Quantity
+    // +optional
+    MinSize string `json:"minSize,omitempty" protobuf:"bytes,3,opt,name=minSize"`
+}
+
+```
+
+### Phase 2: Add huge page support
 
 **Target 1.5+**
 
@@ -172,7 +247,7 @@ The scheduler will need to ensure any huge page request defined in the pod spec
 cAdvisor will need to be modified to return the number of available huge pages.
 This is already supported in [runc/libcontainer](../../vendor/github.com/opencontainers/runc/libcontainer/cgroups/utils.go)
 
-### Phase 2: Expose huge pages in CRI
+### Phase 3: Expose huge pages in CRI
 
 *WIP*
 

From 4c045e3350bf2b45500a79ad5d03915216c1cbb8 Mon Sep 17 00:00:00 2001
From: PiotrProkop <piotr.prokop@intel.com>
Date: Tue, 4 Jul 2017 15:36:02 +0200
Subject: [PATCH 3/4] PageSize defaults to Hugepagesize rather than 2M

---
 contributors/design-proposals/hugepages.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contributors/design-proposals/hugepages.md b/contributors/design-proposals/hugepages.md
index f4ff4b37365..b88fcc7f85f 100644
--- a/contributors/design-proposals/hugepages.md
+++ b/contributors/design-proposals/hugepages.md
@@ -185,7 +185,7 @@ type VolumeSource struct {
 
 // HugePagesSource represents Linux HugeTlbPage https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt                                                                                    
 type HugePagesVolumeSource struct {
-    // Defaults to 2M
+    // Defaults to  platform's default huge page size
     // +optional
     PageSize string `json:"pageSize,omitempty" protobuf:"bytes,1,opt,name=pageSize"`
     // The MaxSize option sets the maximum value of memory (huge pages).

From 551db8fbb75db9af8e22fa236dbb0dced671d7da Mon Sep 17 00:00:00 2001
From: PiotrProkop <piotr.prokop@intel.com>
Date: Tue, 4 Jul 2017 17:16:35 +0200
Subject: [PATCH 4/4] Changes due to @cmluciano review

---
 contributors/design-proposals/hugepages.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contributors/design-proposals/hugepages.md b/contributors/design-proposals/hugepages.md
index b88fcc7f85f..b54de34f33d 100644
--- a/contributors/design-proposals/hugepages.md
+++ b/contributors/design-proposals/hugepages.md
@@ -34,7 +34,8 @@ pre-allocated huge pages instead of THP.
 
 DPDK-based applications are going to request huge pages using `mmap()` system
 call and it is required that a mount point of type `hugetlbfs` is present 
-in the application's mount namespace.
+in the application's mount namespace. Proposed design includes huge page volume 
+plugin that ensures this mount point exists.
 
 The proposal is also limited to x86_64 support where two huge page sizes are
 supported: 2MB and 1GB.  The design, however, should accommodate additional huge