Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add cpuset_mems to docker driver. #16069

Closed
wants to merge 3 commits into from

Conversation

shishir-a412ed
Copy link
Contributor

@shishir-a412ed shishir-a412ed commented Feb 6, 2023

Currently, we have NUMA support for CPU pinning in Nomad docker driver using cpuset_cpus option. However, we don't have support for pinning the workload to a particular set of memory nodes.

Without both (cpuset_cpus and cpuset_mems) the feature is not complete, as one can pin the workload to a particular NUMA node, however the memory can still spread across two NUMA nodes.

This PR will add support for pinning the workload to a particular set of memory nodes.

Added a unit test and updated nomad docker driver docs on the website.

cpuset_mems

Signed-off-by: Shishir Mahajan smahajan@roblox.com

Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
@tgross
Copy link
Member

tgross commented Jan 17, 2024

Hi @shishir-a412ed! As you might have seen, we shipped NUMA support in Nomad Enterprise 1.7.0. I'm going to close this PR, but thanks for kicking off our discussions around this!

@tgross tgross closed this Jan 17, 2024
@neomantra
Copy link

@tgross I was just looking for this feature in the Docker Driver and was even going to add it if it wasn't there.

I did similar pull requests with Docker itself over 10 years ago [1] and discussions in Kubernetes over 8 years ago [2]. Control over this has been in Docker for a long time and is commonly used across many metal deployments.

Nomad Enterprise 1.7.0 shipped with support for NUMA in the Scheduler. Congratulations, it is not an easy problem -- such an advanced feature is worthy of an Enterprise tier.


The Docker Driver Config and the NUMA-Aware Scheduler operate at different levels.

Both cpuset_cpus and cpuset_mems are advanced docker run knobs for controlling process placement. When used an operator, they are NOT using the Nomad scheduler at all and relying upon their specific placement directives. Like cpuset_cpus, the cpuset_mems implementation is a simple pass-through if config to driver; clearly this PR is short, simple, and harmonious with the cpuset_cpus implementation.

The NUMA-Aware scheduler operates at a much higher level, understanding existing workloads and resources, dynamically and optimally placing jobs.

The Docker Driver Configurations are only manageable (but very useful!) at small scale... the NUMA-Aware Scheduler is not Enterprise tier because it can pin jobs to a NUMA-node -- it is Enterprise-tier because it can pin jobs to the most optimal NUMA node given the all resources and constraints of the Cluster.


I personally combine these settings along with node-specifying constraints, that are all data-driven by Terraform. But as noted, the scalability of this is limited and not dynamic like a scheduler.

Lastly, it may actually impair performance to ignore cpuset_mems when also using cpuset_cpus. These are potential footguns for the uninitiated. [3]

The alternative I'm about to try is running numactl in a sidecar, or maintaining our own Nomad build with this one patch.

[1] moby/moby#439
[2] kubernetes/kubernetes#10570 (comment)
[3] https://gist.github.com/neomantra/3c9b89887d19be6fa5708bf4017c0ecd#the-foot-gun (hope to update for Nomad, as this is what I'm doing right now 🤞 )

@neomantra
Copy link

neomantra commented Jan 21, 2024

I channeled my frustration into OSS love / empowerment and started this project right after that comment -- something I've been wanting to make for a few years now. It actually wasn't that hard, given the great Skeleton tooling.

One can use Onload without the NICs for userspace epoll. or with user-space TCP/UDP acceleration via XDP. So it is generally applicable to Nomad Clusters.

I don't personally have the workloads or scale for it -- but for sure there's something amazing to behold with kernel-bypassed Nomad Jobs running on a pool of GPUs and NICs smoothly connected over a fabric of shared memory and hardware acceleration, all optimally orchestrated for mechanical sympathy by a NUMA-aware scheduler that maintains a dynamic model of the entire cluster's CPUs, GPUs, NICs, and Memory and Job's affinities. There's an AI dream just sitting here unassembled.

Please pull on that thread for Enterprise! Don't limit Nomad by closing this issue without merging. 🥺

@tgross
Copy link
Member

tgross commented Jan 22, 2024

@neomantra from an architectural perspective Nomad tries to avoid having resource constraints in the task driver that can conflict with each other after placement without a corresponding scheduler component. The reason being is that if the scheduler isn't aware of those resources then it can make multiple placements to the wrong node in a short period of time and cause deployment failures. This definitely doesn't cover 100% of features of all task drivers (because of other unfortunate architectural decisions like not having the task driver config schema part of the fingerprint), but it's a general goal.

If you've got a specific proposal that you think you could make a case for given that context, we'd be happy to discuss it in a new issue.

@neomantra
Copy link

neomantra commented Jan 22, 2024

@tgross Thanks for the response. I appreciate the context there. I realized my proposal is:

  • present cpuset_cpus and cpuset_mems and cgroup-parent Docker run configs
  • Document them clearly as advanced and dangerous

If used, an Operator really needs to use them all together (or at least understand what they do). They can really mess things up using them at all. If configuring these, they should also consider the parent_cgroup Client config.

But, as I wrote about my specific use case of high-performance isolated workloads, I began to realize that the Docker cgroup_parent also needs to be exposed. Rather than beat a closed PR with it, I will make an issue detailing the reasoning.

Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 22, 2025
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

6 participants