Hacker News new | ask | show | jobs
by Schwan 2122 days ago
Is this really right?

"The danger of not setting a CPU limit is that containers running in the node could exhaust all CPU available."

My assumptions have been: 1. cpu request tells you how much cpu a pod gets MINIMUM always, independently of how much other pods use it or not 2. on GKE you can't request 100% cpu due to google reserving cpu for the node 3. if you have hard limits, your cluster utilisation will be bad -> we do remove cpu limits due to this.

3 comments

The reason a container with no limit can exhaust CPU is that kubernetes CPU requests map to the cpushares accounting system, and CPU limits map to the Completely Fair Scheduler's cpuquota system. The cpushares system divides a core into 1024 shares, and guarantees a process gets the number of shares it reserves, but it does not limit the process from taking more shares if other processes aren't consuming them. The cpuquota system divides CPU time into periods of... I think... 100k microseconds by default, and hard limits a process at the number of microsecs per periods it requests. So if you don't set limits you're only using the cpushares system, and are free to take up as much idle CPU as you can grab.
1 is correct, 2 is partially correct, and 3 is not correct.

It is absolutely true Kubernetes will reserve the amount of CPU you request, although it will also allow you to exceed that request if you attempt to and there is free CPU time to service you. 2 is correct in so far as Google run daemonsets on GKE which themselves have CPU requests and limits, and thus there will never be a node which as 100% cpu free for you to request. 3 is simply incorrect - it may be true that for some combinations of nodes and workloads it is not possible for the Kubernetes scheduler to bin-pack efficiently, but for large clusters with diverse workloads this should not be a problem.

Excluding kernel bugs, CPU limits just provide an upper bound on burst capacity. That controls oversubscription of CPU on a node. As with any other kind of oversubscription of a resource based on variable demand, there is a tradeoff. Allowing one pod to burst over its request is both unreliable and potentially impacting other neighboring pods. Whether that improves your cluster efficiency or introduces intolerably high variability in service latency and throughput depends on your mix of workloads and how the scheduler distributes your various pods.

Buffer's solution of having different flavors of node, onto which mutually compatible workloads are scheduled in isolation from incompatible ones, is a very reasonable thing to do, even if this particular case is a bit of a head-scratcher.