Skip to content

Implicit tolerations #5282

Open
Open
@johnbelamaric

Description

@johnbelamaric

Enhancement Description

Administrators often taint nodes with high-value resources like GPUs, to avoid them being consumed by workloads that do not need them. To simplify the user experience, some platforms (e.g., GKE) run a webhook to automatically tolerate those taints, if the pods have extended resource requests for those resources. This ensures that pods still run even if the user forgets to add the toleration, but only for those pods that actually need it.

With the advent of DRA, the exact needs of the workload are no longer determinable simply by looking at the PodSpec during API admission. Instead, the resource claims and device classes must also be examined. Additionally, the optionality available in DRA resource claim APIs may mean that several different types of nodes/resources (and therefore several different types of tolerations) are needed. A webhook does not have access to all the information it would need to add the tolerations at API admission time.

We discussed adding a "high value resource" aspect to node capabilities, but after further discussion it's not clear that's the right way to solve this problem. This enhancement request provides an alternative approach.

In this approach, we create a new scheduler plugin (or update the existing taints & tolerations plugin), which can be configured to examine the PodSpec and all associated Resource Claims and DeviceClasses at scheduling time and, based on the needs of the workload, implicitly tolerate taints. Essentially, we move the behavior of the web hook from API server admission time, to Pod scheduling time. This allows all necessary information to be available.

The specific way to calculate the tolerations, and the taints which they will tolerate will likely need to be part of the configuration of the scheduler plugin, since it is not known upstream what those taints are and when/how they should be tolerated.

This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.

/cc @pohly @klueska @pravk03 @dom4ha @dchen1107
/sig scheduling
/wg device-management

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

Metadata

Metadata

Labels

lead-opted-inDenotes that an issue has been opted in to a releasesig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.stage/alphaDenotes an issue tracking an enhancement targeted for Alpha statuswg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

Type

No type

Projects

Status

📋 Backlog

Status

Needs Triage

Status

Removed from Milestone

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions