Skip to content

Commit 76092c3

Browse files
author
Renaud Gaubert
committed
Added Device plugin proposal
1 parent c426590 commit 76092c3

File tree

2 files changed

+197
-0
lines changed

2 files changed

+197
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Device Manager Proposal
2+
3+
* [Abstract](#abstract)
4+
* [Motivation](#motivation)
5+
* [Use Cases](#use-cases)
6+
* [Objectives](#objectives)
7+
* [Proposed Changes](#proposed-changes)
8+
9+
10+
_Authors:_
11+
12+
* @RenaudWasTaken - Renaud Gaubert <rgaubert@nvidia.com>
13+
14+
## Abstract
15+
16+
This documents describes a solution to discovering, monitoring and representing
17+
external devices such as:
18+
* GPUs
19+
* NICs
20+
* FPGAs
21+
* InfiniBand
22+
* ...
23+
24+
## Motivation
25+
26+
Kubernetes currently supports discovery of CPU and Memory primarily to a
27+
minimal extent. Very few devices are handled natively by Kubelet.
28+
29+
It is not a sustainable solution to expect every vendor to add their vendor
30+
specific code inside kubernetes. This approach does not scale and is not portable.
31+
32+
We want a solution for those vendors to be able to advertise their resources to the kubelet
33+
and monitor them.
34+
We also want a way for the user to specify which resource their jobs will use and what
35+
constraints are associated to these resources.
36+
37+
In order to solve this problem it is obvious that we need a plugin system in
38+
order to have vendors advertise and monitor their resources on behalf of Kubelet.
39+
40+
Additionally, we introduce the concept of ResourceType to be able to select
41+
resources with constraints in a pod spec.
42+
43+
_GPU Integration Example:_
44+
* [Enable "kick the tires" support for Nvidia GPUs in COS](https://github.com/kubernetes/kubernetes/pull/45136)
45+
* [Extend experimental support to multiple Nvidia GPUs](https://github.com/kubernetes/kubernetes/pull/42116)
46+
47+
_Kubernetes Meeting Notes On This:_
48+
* [Meeting notes](https://docs.google.com/document/d/1Qg42Nmv-QwL4RxicsU2qtZgFKOzANf8fGayw8p3lX6U/edit#)
49+
* [Better Abstraction for Compute Resources in Kubernetes](https://docs.google.com/document/d/1666PPUs4Lz56TqKygcy6mXkNazde-vwA7q4e5H92sUc)
50+
* [Extensible support for hardware devices in Kubernetes (join kubernetes-dev@googlegroups.com for access)](https://docs.google.com/document/d/1LHeTPx_fWA1PdZkHuALPzYxR0AYXUiiXdo3S0g2VSlo/edit)
51+
52+
## Use Cases
53+
54+
* I want to use a particular device type (GPU, InfiniBand, FPGA, etc.) in my pod.
55+
* I should be able to use that device without writing custom Kubernetes code.
56+
* I want a consistent and portable solution to consuming hardware devices across k8s clusters
57+
58+
## Objectives
59+
60+
1. Create a plugin mechanism which allows discovery and monitoring of devices
61+
3. Add support for ResourceType in the scheduler and kubelet
62+
63+
## Proposed changes
64+
65+
### API Changes
66+
#### ResourceType
67+
68+
When discovering the devices, Kubelet will be in charge of advertising those
69+
resources to the API server.
70+
71+
We will advertise each device returned by the Device Plugin in a new structure
72+
called the ResourceType.
73+
It is defined as follows:
74+
75+
```golang
76+
type ResourceType struct {
77+
Kind string
78+
Name string
79+
Quantity resource.Quantity
80+
Properties map[string]string
81+
}
82+
```
83+
84+
Because the current API (Capacity) can not be extended to support ResourceType,
85+
we will need to create two new attributes in the NodeStatus structure:
86+
* `CapacityV2`: Describing the capacity of the node
87+
* `Available`: Describing the available resources
88+
89+
```golang
90+
type NodeStatus struct {
91+
Capacity ResourceList
92+
CapacityV2 []ResourceType
93+
94+
Available[]ResourceType
95+
}
96+
```
97+
98+
We also introduce the `Allocated` field in the pod's status so that user
99+
can know what devices were assigned to the pod.
100+
101+
```golang
102+
type PodStatus struct {
103+
Allocated[]ResourceType
104+
}
105+
```
106+
107+
### Device Plugin
108+
109+
We expect device plugins to be deployed through a daemonSet. The plugins will have to register
110+
themselves with the kubelet when they start running.
111+
Kubelet will then start interacting with the plugin through the `List`, `Watch`, `Allocate`
112+
and `Deallocate` functions.
113+
114+
The device plugin will have to mount `/var/lib/kubelet/plugins/device-plugin/kubelet.sock` which
115+
is the socket that the plugin can use to register itself.
116+
117+
Registration is a simple process where the device plugin communicates it's IP address (obtained though
118+
the downwards API) and the port on which the gRPC server is listening.
119+
120+
When receiving a pod which requests GPUs kubelet will be in charge of:
121+
* deciding which device to assign to the pod's containers
122+
* advertising the changes to the node's `Available` list
123+
* advertising the changes to the pods's `Allocated` list
124+
* Calling the `Allocate` function with the list of devices
125+
126+
The scheduler will still be in charge of filtering the nodes which cannot satisfy the
127+
resource requests.
128+
129+
The typical process will follow the following pattern:
130+
1. A user submits a pod spec requesting X devices
131+
2. The scheduler filters the nodes which do not match the resource requests
132+
3. The pod lands on the node
133+
134+
4. Kubelet decides which device should be assigned to the pod
135+
5. Kubelet removes the devices from the `Available` pool
136+
6. Kubelet updates the pod's status with the allocated devices
137+
7. Kubelet calls `Allocate` on the matching Device Plugins
138+
139+
8. The user deletes the pod or the pod terminates
140+
9. Kubelet calls `Deallocate` on the matching Device Plugins
141+
10. Kubelet puts the devices back in the `Available` pool
142+
143+
![Process](./device-plugin.svg)
144+
145+
The kubelet will also be able to call `Allocate` and `Deallocate` on any devices returned
146+
by the `Discover` function.
147+
148+
Calling `Allocate` will return a CRI spec which allows the plugin to set cgroups, specify
149+
environment variables, ...
150+
151+
Calling `Allocate` on a device already Allocated and not Deallocated as well as
152+
calling `Deallocate` on a device not Allocated or already Deallocated should return an error.
153+
154+
When calling Allocate or Deallocate, only the Name field needs to be set.
155+
156+
```go
157+
service PluginRegistration {
158+
rpc Register(DialAddress) returns (Error)
159+
}
160+
161+
service DeviceManager {
162+
rpc Discover() returns (stream Device);
163+
rpc Monitor() returns (stream DeviceHealth);
164+
165+
rpc Allocate(PodSandboxConfig, stream Device) returns (PodSandboxConfig)
166+
rpc Deallocate(stream ResourceType) returns (Error)
167+
}
168+
169+
message Device {
170+
string Kind = 0;
171+
string Name = 1;
172+
string Quantity = 2;
173+
map<string, string> properties = 3; // Could be [1, 1.2, 1G]
174+
}
175+
176+
message DeviceHealth {
177+
string Name;
178+
string Status;
179+
}
180+
```
Loading

0 commit comments

Comments
 (0)