-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add cpuset_mems to docker driver. #16069
Conversation
170c6ea
to
81cb7c1
Compare
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
81cb7c1
to
e3b6399
Compare
035f59d
to
84f3556
Compare
Signed-off-by: Shishir Mahajan <smahajan@roblox.com>
84f3556
to
cf9abc8
Compare
Hi @shishir-a412ed! As you might have seen, we shipped NUMA support in Nomad Enterprise 1.7.0. I'm going to close this PR, but thanks for kicking off our discussions around this! |
@tgross I was just looking for this feature in the Docker Driver and was even going to add it if it wasn't there. I did similar pull requests with Docker itself over 10 years ago [1] and discussions in Kubernetes over 8 years ago [2]. Control over this has been in Docker for a long time and is commonly used across many metal deployments. Nomad Enterprise 1.7.0 shipped with support for NUMA in the Scheduler. Congratulations, it is not an easy problem -- such an advanced feature is worthy of an Enterprise tier. The Docker Driver Config and the NUMA-Aware Scheduler operate at different levels. Both The NUMA-Aware scheduler operates at a much higher level, understanding existing workloads and resources, dynamically and optimally placing jobs. The Docker Driver Configurations are only manageable (but very useful!) at small scale... the NUMA-Aware Scheduler is not Enterprise tier because it can pin jobs to a NUMA-node -- it is Enterprise-tier because it can pin jobs to the most optimal NUMA node given the all resources and constraints of the Cluster. I personally combine these settings along with node-specifying Lastly, it may actually impair performance to ignore The alternative I'm about to try is running [1] moby/moby#439 |
I channeled my frustration into OSS love / empowerment and started this project right after that comment -- something I've been wanting to make for a few years now. It actually wasn't that hard, given the great Skeleton tooling. One can use Onload without the NICs for userspace I don't personally have the workloads or scale for it -- but for sure there's something amazing to behold with kernel-bypassed Nomad Jobs running on a pool of GPUs and NICs smoothly connected over a fabric of shared memory and hardware acceleration, all optimally orchestrated for mechanical sympathy by a NUMA-aware scheduler that maintains a dynamic model of the entire cluster's CPUs, GPUs, NICs, and Memory and Job's affinities. There's an AI dream just sitting here unassembled. Please pull on that thread for Enterprise! Don't limit Nomad by closing this issue without merging. 🥺 |
@neomantra from an architectural perspective Nomad tries to avoid having resource constraints in the task driver that can conflict with each other after placement without a corresponding scheduler component. The reason being is that if the scheduler isn't aware of those resources then it can make multiple placements to the wrong node in a short period of time and cause deployment failures. This definitely doesn't cover 100% of features of all task drivers (because of other unfortunate architectural decisions like not having the task driver config schema part of the fingerprint), but it's a general goal. If you've got a specific proposal that you think you could make a case for given that context, we'd be happy to discuss it in a new issue. |
@tgross Thanks for the response. I appreciate the context there. I realized my proposal is:
If used, an Operator really needs to use them all together (or at least understand what they do). They can really mess things up using them at all. If configuring these, they should also consider the But, as I wrote about my specific use case of high-performance isolated workloads, I began to realize that the Docker |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Currently, we have NUMA support for CPU pinning in Nomad docker driver using cpuset_cpus option. However, we don't have support for pinning the workload to a particular set of memory nodes.
Without both (
cpuset_cpus
andcpuset_mems
) the feature is not complete, as one can pin the workload to a particular NUMA node, however the memory can still spread across two NUMA nodes.This PR will add support for pinning the workload to a particular set of memory nodes.
Added a unit test and updated nomad docker driver docs on the website.
Signed-off-by: Shishir Mahajan smahajan@roblox.com