Add cpuset_mems to docker driver. #16069

shishir-a412ed · 2023-02-06T19:44:46Z

Currently, we have NUMA support for CPU pinning in Nomad docker driver using cpuset_cpus option. However, we don't have support for pinning the workload to a particular set of memory nodes.

Without both (cpuset_cpus and cpuset_mems) the feature is not complete, as one can pin the workload to a particular NUMA node, however the memory can still spread across two NUMA nodes.

This PR will add support for pinning the workload to a particular set of memory nodes.

Added a unit test and updated nomad docker driver docs on the website.

Signed-off-by: Shishir Mahajan smahajan@roblox.com

drivers/docker/driver_test.go

Signed-off-by: Shishir Mahajan <smahajan@roblox.com>

tgross · 2024-01-17T16:57:23Z

Hi @shishir-a412ed! As you might have seen, we shipped NUMA support in Nomad Enterprise 1.7.0. I'm going to close this PR, but thanks for kicking off our discussions around this!

neomantra · 2024-01-19T22:56:30Z

@tgross I was just looking for this feature in the Docker Driver and was even going to add it if it wasn't there.

I did similar pull requests with Docker itself over 10 years ago [1] and discussions in Kubernetes over 8 years ago [2]. Control over this has been in Docker for a long time and is commonly used across many metal deployments.

Nomad Enterprise 1.7.0 shipped with support for NUMA in the Scheduler. Congratulations, it is not an easy problem -- such an advanced feature is worthy of an Enterprise tier.

The Docker Driver Config and the NUMA-Aware Scheduler operate at different levels.

Both cpuset_cpus and cpuset_mems are advanced docker run knobs for controlling process placement. When used an operator, they are NOT using the Nomad scheduler at all and relying upon their specific placement directives. Like cpuset_cpus, the cpuset_mems implementation is a simple pass-through if config to driver; clearly this PR is short, simple, and harmonious with the cpuset_cpus implementation.

The NUMA-Aware scheduler operates at a much higher level, understanding existing workloads and resources, dynamically and optimally placing jobs.

The Docker Driver Configurations are only manageable (but very useful!) at small scale... the NUMA-Aware Scheduler is not Enterprise tier because it can pin jobs to a NUMA-node -- it is Enterprise-tier because it can pin jobs to the most optimal NUMA node given the all resources and constraints of the Cluster.

I personally combine these settings along with node-specifying constraints, that are all data-driven by Terraform. But as noted, the scalability of this is limited and not dynamic like a scheduler.

Lastly, it may actually impair performance to ignore cpuset_mems when also using cpuset_cpus. These are potential footguns for the uninitiated. [3]

The alternative I'm about to try is running numactl in a sidecar, or maintaining our own Nomad build with this one patch.

[1] moby/moby#439
[2] kubernetes/kubernetes#10570 (comment)
[3] https://gist.github.com/neomantra/3c9b89887d19be6fa5708bf4017c0ecd#the-foot-gun (hope to update for Nomad, as this is what I'm doing right now 🤞 )

neomantra · 2024-01-21T15:16:33Z

I channeled my frustration into OSS love / empowerment and started this project right after that comment -- something I've been wanting to make for a few years now. It actually wasn't that hard, given the great Skeleton tooling.

https://github.com/neomantra/nomad-onload

One can use Onload without the NICs for userspace epoll. or with user-space TCP/UDP acceleration via XDP. So it is generally applicable to Nomad Clusters.

I don't personally have the workloads or scale for it -- but for sure there's something amazing to behold with kernel-bypassed Nomad Jobs running on a pool of GPUs and NICs smoothly connected over a fabric of shared memory and hardware acceleration, all optimally orchestrated for mechanical sympathy by a NUMA-aware scheduler that maintains a dynamic model of the entire cluster's CPUs, GPUs, NICs, and Memory and Job's affinities. There's an AI dream just sitting here unassembled.

Please pull on that thread for Enterprise! Don't limit Nomad by closing this issue without merging. 🥺

tgross · 2024-01-22T13:48:43Z

@neomantra from an architectural perspective Nomad tries to avoid having resource constraints in the task driver that can conflict with each other after placement without a corresponding scheduler component. The reason being is that if the scheduler isn't aware of those resources then it can make multiple placements to the wrong node in a short period of time and cause deployment failures. This definitely doesn't cover 100% of features of all task drivers (because of other unfortunate architectural decisions like not having the task driver config schema part of the fingerprint), but it's a general goal.

If you've got a specific proposal that you think you could make a case for given that context, we'd be happy to discuss it in a new issue.

neomantra · 2024-01-22T17:10:45Z

@tgross Thanks for the response. I appreciate the context there. I realized my proposal is:

present cpuset_cpus and cpuset_mems and cgroup-parent Docker run configs
Document them clearly as advanced and dangerous

If used, an Operator really needs to use them all together (or at least understand what they do). They can really mess things up using them at all. If configuring these, they should also consider the parent_cgroup Client config.

But, as I wrote about my specific use case of high-performance isolated workloads, I began to realize that the Docker cgroup_parent also needs to be exposed. Rather than beat a closed PR with it, I will make an issue detailing the reasoning.

github-actions · 2025-01-22T02:16:46Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

vercel bot deployed to Preview – nomad-storybook-and-ui February 6, 2023 19:51 View deployment

georgeli-roblox reviewed Feb 6, 2023

View reviewed changes

drivers/docker/driver_test.go Outdated Show resolved Hide resolved

shishir-a412ed force-pushed the cpuset_mems branch from 170c6ea to 81cb7c1 Compare February 7, 2023 02:18

vercel bot deployed to Preview – nomad-storybook-and-ui February 7, 2023 02:23 View deployment

Add cpuset_mems to docker driver.

e3b6399

Signed-off-by: Shishir Mahajan <smahajan@roblox.com>

shishir-a412ed force-pushed the cpuset_mems branch from 81cb7c1 to e3b6399 Compare February 7, 2023 02:54

vercel bot deployed to Preview – nomad-storybook-and-ui February 7, 2023 02:58 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui February 7, 2023 03:17 View deployment

shishir-a412ed force-pushed the cpuset_mems branch from 035f59d to 84f3556 Compare February 7, 2023 03:54

vercel bot deployed to Preview – nomad-storybook-and-ui February 7, 2023 03:59 View deployment

Add test for invalid CPUSetMEMs.

cf9abc8

Signed-off-by: Shishir Mahajan <smahajan@roblox.com>

shishir-a412ed force-pushed the cpuset_mems branch from 84f3556 to cf9abc8 Compare February 7, 2023 18:01

vercel bot deployed to Preview – nomad-storybook-and-ui February 7, 2023 18:06 View deployment

retrigger checks

1519d22

vercel bot deployed to Preview – nomad-storybook-and-ui May 19, 2023 19:14 View deployment

jrasell self-assigned this Jan 10, 2024

tgross closed this Jan 17, 2024

neomantra mentioned this pull request Jan 24, 2024

Update cpuset_cpus in documentation #19815

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cpuset_mems to docker driver. #16069

Add cpuset_mems to docker driver. #16069

shishir-a412ed commented Feb 6, 2023 •

edited

Loading

tgross commented Jan 17, 2024

neomantra commented Jan 19, 2024

neomantra commented Jan 21, 2024 •

edited

Loading

tgross commented Jan 22, 2024

neomantra commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Jan 22, 2025

Add cpuset_mems to docker driver. #16069

Add cpuset_mems to docker driver. #16069

Conversation

shishir-a412ed commented Feb 6, 2023 • edited Loading

tgross commented Jan 17, 2024

neomantra commented Jan 19, 2024

neomantra commented Jan 21, 2024 • edited Loading

tgross commented Jan 22, 2024

neomantra commented Jan 22, 2024 • edited Loading

github-actions bot commented Jan 22, 2025

shishir-a412ed commented Feb 6, 2023 •

edited

Loading

neomantra commented Jan 21, 2024 •

edited

Loading

neomantra commented Jan 22, 2024 •

edited

Loading