| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by otterley 2220 days ago

I actually disagree with the first recommendation as written - specifically, not to set a CPU resource request to a small amount. It's not always as harmful as it might sound to the novice.

It's important to understand that CPU resource requests are used for scheduling and not for limiting. As the author suggests, this can be an issue when there is CPU contention, but on the other hand, it might not be. That's because memory limits are even more important than CPU requests when scheduling: most applications use far more memory as a proportion of overall host resources than CPU.

Let's take an example. Suppose we have a 64GB worker node with 8 CPUs in it. Now suppose we have a number of pods to schedule on it, each with a memory limit of 2GB and a CPU request of 1 millicore (0.001CPU). On this node, we will be able to accommodate 32 such pods.

Now suppose one of the pods gets busy. This pod can have all the idle CPU it wants! That's because it's a request and not a limit.

Now suppose all of the pods become fully CPU contended. The way the Linux scheduler works is that it will use the CPU request as a relative weight with respect to the other processes in the parent cgroup. It doesn't matter that they're small as an absolute value; what matters is their relative proportion. So if they're all 1 millicore, they will all get equal time. In this example, we have 32 pods and 8 CPUs, so under full contention, each will get 0.25 CPU shares.

So when I talk to customers about resource planning, I actually usually recommend that they start with low CPU reservation, and optimize for memory consumption until their workloads dictate otherwise. It does happen that particularly greedy pods are out there, but that's not the typical case - and for those that are, they will often allocate all of a worker's CPUs in which case you might as well dedicate nodes to them and forget about how to micromanage the situation.

3 comments

jeffbee 2220 days ago

If you ask for 0.001 CPU share, you might get it. I would advise caution. You that pod gets scheduled on a node with another node that asks for 4 CPUs and 100MB of memory, it's not going to get any time.

link

otterley 2220 days ago

It depends. If the second pod requests 4 CPUs, it doesn't necessarily mean that the first pod can't use all the CPUs in the uncontended case.

A lot of this depends on policy and cooperation, which is true for any multitenant system. If the policy is that nobody requests CPU, then the behavior will be like an ordinary shared Linux server under load - the scheduler will manage it as fairly as possible. OTOH, if there are pods that are greedy and pods that are parsimonious in terms of their requests, the greedy pods will get the lion's share of the resources if it needs them.

The flip side of overallocating CPU requests is cost. This value is subtracted from the available resources, making the node unavailable to do other useful work. Most of the time I see customers making the opposite mistake - overallocating CPU requests so much that their overall CPU utilization is well under 25% during peak periods.

link

jeffbee 2220 days ago

Most people would be thrilled to get anything close to 25% CPU util. I guess one of the big missing pieces fro Borg that hasn't landed in k8s is node resource estimation. If you have a functional estimator, setting requests and limits becomes a bit less critical.

link

coredog64 2219 days ago

1000% agree. Former employer had a proprietary app scheduler that worked like this. We would frequently tell users to request as little CPU as possible. Extra CPU would be shared, but if you made an unreasonable request you’d never get scheduled in the shared environment.

link

marekaf 2219 days ago

I agree! I assumed not to trust anyone with all the greedy pods they can schedule.

The example you are describing is probably not super common but I will try to rephrase my blogpost so that it reflects this comment:)

link

otterley 2219 days ago

Sorry, what is not super common? With my customers I rarely see incidents due to CPU starvation of a pod in their K8S clusters.

link