-
Notifications
You must be signed in to change notification settings - Fork 40.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Determine if we should support cpuset-cpus and cpuset-mem #10570
Comments
Short answerNot soon. Long answerSupporting both cpusets and cfs well is very complex. Having to support both options will increase the development time of some features which we definitely want, such as:
The complexity will become further evident if we later add features like:
If we expose the docker flags to users via the Pod spec today, it will be much harder to build these things. However, we should being collecting the use cases where users think cpusets will help them. Then we can have a broader cost/benefit discussion about supporting cpusets. I think I know what many of these use cases are, but it would still to be good to hear users state them. |
@erictune - I am going to direct the original questioner to this issue for feedback on their needs, but my initial response was in-line with yours prior to opening the issue. |
@erictune Is it possible for you to explain the main difficulty if we support cpuset? My use case is: as k8s will support job controller soon, we want to fix Long Running Container to some cpu and let other jobs (batches) to compete for other cores. Does that make sense? |
Hi Eric, By your initial comments, I'm sure we're on the same page in understanding the technical, performance benefits of this feature. Indeed, the use-cases for exposing these knobs are precisely as you've stated. If a users has an existing performance sensitive application that is lends itself well to microservices/containerization, they'll need all of these features in order to consider containers as a possibility. These applications typically have their own init-scripts which apply tuning at both the host and app level, and (very often), automatically account for different hardware/virtual topologies. We can approach containerizing this sort of application by simply migrating the init-scripts (hardware discovery, numactl, SCHED_FIFO and irq stuff) into each container. This of course is the definition of anti-pattern. However, I'd argue that wanting to run performance-sensitive-pods (I call them "PSP's") in Kube is not an anti-pattern -- it's a pattern that's not implemented yet. Agree that usage may require some sort of nodeSelector matching, to avoid the complexities of having CFS workloads co-located with these HPC loads. Exposing these knobs are one step in enabling a Kube stack to support workloads such as batch processing, HPC/scientific, big-data compute-bound analysis. Examples of additional enablement would be exposing specific I/O devices to pods, RDMA, having a container "follow" a PCI device in terms of NUMA locality. Thanks! |
There are benefits to exposing this level of detail (cores, numa nodes) to users, and there are costs to exposing it as well. Within the problem space of running web services, my experience is that on balance it is better not to expose these details to users. I can see from your comments, Jeremy, that you are interested in high-frequency trading (HFT). With web services, latency is measured in milliseconds. I've heard that for HFT, it is measured in microseconds or even nanoseconds. For very short deadlines, like with HFT, I see how it it is important to expose these details. So, I think the question is whether Kubernetes is a good fit for HFT, or if it can be adapted to be a good fit for HFT without compromising the web services use case. I'm not sure what the answer is to this question. |
Some examples to illustrate my previous comment: Say your process is sharing a cpu core with another process (time-slicing), and that other process occasionally evicts all your data from L1 cache when it runs. Then the next time you run, you have to refill your L1. It takes, around ten of microseconds to refill the L1. In the web services case, 10us << 1ms, so you don't care. In the HFT case, 10us >> 1us, so you totally care. Even if the processes are sharing the L1 cache nicely, Linux context switch time can be several microseconds (ref). Again, this is too long for HFT, but not too long for web services. So, for short deadlines, like in HFT, you need to allocate specific cores to specific containers to avoid context switch overheads and L1 cache interference. But for web serving applications, you need to allow fraction-of-a-core requests, to get better cluster utilization* (see Borg paper, section 5.4). Mixing specific core allocations with fraction-of-a-core requests is tricky. Doing so while also considering NUMA is trickier. Also, allocating entire cores to containers hinders resource reclamation (ibid, section 5.5). |
I agree with Eric (not surprising since we're drawing on mostly the same experiences from Borg). It's not that these features are not useful, it's that they are an attractive nuisance. People see them and want to tweak them. Once a significant number of people use them it is MUCH harder to claim them for automated use (impossible in some cases) and we lose out on more globally optimal designs. This isn't even hypothetical - we know pretty clearly what we want to build here, we just need to get a hundred other things lined up before it makes sense. So the question becomes should we enable control of a know we know we want to take back later, and if so, how? My feeling is that IF we do it, we should either do it very coarsely (reserved whole-cores or even reserved LLCs) or we should do it in a way that is very clearly stepping outside the normal bounds (i.e. you break it, you buy it). It's DEFINITELY not as simple as adding a couple fields to the API. |
For concrete example: we could do this through the proposed opaque extensions. Tell docker what you want, but you are very clearly outside the supported sphere. Or we could make opaque counted resources for things like LLC0 and LLC1 (with a count of 1) so you could schedule and ask for 1 instance of LLC0 - clearly you know what your platform architecture is in that case. There are a lot of design avenues I might think are acceptable. |
Understand it's unavoidable to at least consider the implications of this feature, but let's walk this back to a higher level. Eric asked: "I think the question is whether Kubernetes is a good fit for HFT, or if it can be adapted to be a good fit for HFT without compromising the web services use case." Indeed this is the root-question. I have mentioned HFT in other github issues, but was careful not to bring it up here because it's too focused on a particular use-case. I've seen others express interest in various github issues for Kube to support performance-sensitive-pods and workloads. One that was brought up is NFV, or network function virtualization. NFV is remarkably similar to HFT in terms of what it will need from Kube (I'd argue they're basically the same). Another is HPC workloads (which I alluded to with comment about RDMA), again which are vastly different than web services. All of these workloads and industry verticals stand to benefit significantly from the development methodology efficiencies of container-based workflows and related ecosystems. However -- without proper support from the orchestration system, it's likely that those industries would need to custom build in-house solutions (again). I've read the borg paper several times (thanks for releasing it), and so am somewhat aware of background decisions/experiments and Google's business factors that were involved in the genesis of Borg. However, just wanted to point out as I'm quite sure you're aware -- there isn't always a 1:1 mapping between those experiences and other industries. Support for performance-sensitive-pods would broaden the applicability of Kube beyond web services -- and this is the philosophical question to be answered. |
Any thoughts, guys ? |
I think we can support cases like HFT, but it's going to require a There is actually a cost to running apps under a system like Borg or Philosophically there is no objection, I think, just a desire to tread On Thu, Dec 10, 2015 at 4:51 PM, Jeremy Eder notifications@github.com
|
Agree with your point, Jeremy, that the Borg experience does not necessarily map to experience in other industries. I've long known that there is an impedance mismatch when running with some types of scientific computing workloads on a Borg-style system. The HFT and NFV cases are new to me, but I can see where, if you were designing a system specifically for those cases, that you might do some things differently. We've got to first make Kubernetes not just good but awesome for the use cases (web service and web analytics and map-reduce-type batch) that it was created for. Then we can double back and see what can be done to optimize it for cases like NFV and HFT. I think a bunch of us have ideas on how to support those cases better, but doing deep dive on those right now could draw attention away from the current focus. |
/cc @nqn @ConnorDoyle |
I am using containers with an HFT-type workload, generically termed "Fast Data". I like this notion of Performance Sensitive Pods. I'm sharing a concrete use case to help with the design of this: My underlying setup involves common practices for these workloads (though often not containerized):
Currently, I manually manage all of this with configuration (e.g. run which containers on cores on which hosts) and schedule it with cron. It's a pain a doesn't scale well at all. But, performance-wise it is awesome. Since I don't currently use Kubernetes for these workloads, I don't use Pods. I do set up multiple containers that should be placed near each other (same NUMA node or even share a socket), e.g. a network-heavy process reading packets, processing them, then feeding the results to a Redis process on another core. At the simplest level, I'd like Kubernetes to know that I have:
and that I'd like to schedule a Pod that needs N cores on a compatible node. |
Thanks very much for that -- I've seen your github repo. We are designing for exactly your use-case and more. We have NUMA and cpuset prototype working internally and have shared our design goals with Solarflare engineering. I was planning to discuss in more depth at the KubeCon developer conference. So you have some background, I represent RH in stacresearch.com consortia for tuning for HFT and exchange-side workloads and have written several whitepapers for Red Hat on tuning RHEL for HFT using Solarflare adapters. We should be able to build a kick-ass system! |
Evan - thanks for the details! Given what you described about your workload (multiple containers that For this style of workload, do you prefer a model:
In general, I am wondering if there is simplification possible at the Pod On Fri, Oct 21, 2016 at 4:09 PM, Jeremy Eder notifications@github.com
|
I'm particularly interested in the CAREFUL evolution of this at the API On Fri, Oct 21, 2016 at 1:09 PM, Jeremy Eder notifications@github.com
|
@thockin -- agreed. hope to chat more at kubecon on it. |
@derekwaynecarr Sorry for the delay in replying, I've re-written this a few times and still feel like I don't know enough to comment well and don't want to bike shed. So, I'll try to document more of what I'm doing now and map it to the two choices you gave (which did help me think about this). For the case I wrote about earlier, I would have to go with multiple pods as my containers have different lifetimes. The datastore (Redis) goes round the clock, whereas the feed handlers / data processors are restarted overnight. Those are in-house programs, so I could change that. What would be important here is that I get accelerated TCP loopback as it makes a difference for me. I also have single-process, multi-threaded services that are configured to take a set of isolated cores and non-isolated cores. The isolated cores are devoted to ripping multicast packets off the network, performing calculations and transformations on them, and then storing results in thread-safe/concurrent data structures. The non-isolated cores have HTTP worker threads (via Proxygen) assigned to them; these take client requests and access the data structures to respond. I've also though about adopting gRPC workers in a similar fashion. I don't do this currently, but if I had shared-memory multi-process concurrent data structures, then the single pod approach would be appropriate. Another common tuning suggestion that I forgot to mention before is IRQ affinity. Unfortunately, I won't be at kubecon, but am happy to connect in other ways than an issue thread. |
Please consider this presentation about "cpushare vs. cpuset vs JRE" |
Any updates on this topic? What are the decisions after the kubecon? |
The feedback was we need to prototype using side-car techniques, prove out the ideas, and then work to get them built into Kube. This means using node-feature-discovery pod for the hardware discovery. NFD feeds information into Opaque Integer Resources. We extend the Kube API using ThirdPartyResources such that a pod manifest can request those resources as specified by OIR.
A user would then specify numa=true and (optionally) numa_node=1. The numa=true portion is handled by the kube scheduler because we have OIR upstream and thus the scheduler can do the "fitting" aspect. The numa_node=1 decision would be a "node local decision", IOW handled by the Kubelet. If it cannot fulfill the request, the pod would fail scheduling and the Kube scheduler would try another node. The mechanics between the Kubelet and the NFD pod for the node-local-decision ... I don't think we reached precise concensus on that, but thought was that NFD would provide feedback to Kubelet about the NUMA node. We think we'd need a small kubelet change to "shell-out" to NFD for node-local-decisions. Long term we hope to graduate this into proper Kube objects rather than TPR, while formalizing the NFD (aka node-agent) pod technique along with several other supporting features we need to complete the design (I'm sure you understand that cpuset is a small (yet critical) part of the big picture). |
I just discovered that Docker doesn't honor I'm pretty surprised by this. That makes it "dangerous" to combine pinned workloads (either bare or via Basically you need to specify |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
/close CPU manager docs: https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/ If folks don’t agree this thread has run its course please re-open. |
Docker now supports
cpuset-cpus
andcpuset-mem
arguments when running a container to control which cpus can execute a container. We have been asked by some users if we plan to support this feature in Kubernetes.https://docs.docker.com/reference/run/#runtime-constraints-on-cpu-and-memory
Opening an issue to gather feedback on if its desired or not.
@erictune @dchen1107
The text was updated successfully, but these errors were encountered: