Hacker News new | ask | show | jobs
by rsanders 2117 days ago
Removing CPU limits seems like a bad idea now that there's a kernel fix. But putting that aside...

I don't understand why pods without CPU limits would cause unresponsive kubelets. For a long time now Kubernetes has allocated a slice for system services. While pods without CPU limits are allowed to burst, they are still limited to the amount of CPU allocated to kubernetes pods.

Run "systemd-cgls" on a node and you'll see two toplevel slices: kubepods and system. The kubelet process lives within the system slice.

If you run "kubectl describe <node>" you can see the resources set aside for system processes on the node. Processes in the system slice should always have (cpu_capacity - cpu_allocatable) available to share, no matter what happens in the kubepods slice.

    Capacity:
        cpu:                         8
        ephemeral-storage:           83873772Ki
        memory:                      62907108Ki
    Allocatable:
        cpu:                         7910m
        ephemeral-storage:           76224326324
        memory:                      61890276Ki
        pods:                        58
Granted, it's not a large proportion of CPU.
4 comments

It might depend alot on the distribution and how kubernetes is started. How much CPU time to reserve for system services from the scheduler (Allocatable as you pointed out) needs to be passed to kubelet, and I think only really applies to guarenteed pods.

What I did on the distribution I work on, is tune the cgroup shares so control plane services are allocated CPU time ahead of pods (whether guarenteed, burstable, or best effort). We don't run anything as static containers, so this covers all the kube services, etcd, system services, etc.

Before this change in our distribution, IIRC, pods and control plane had equal waiting, which allowed the possibility for kubelet or other control plane services to be starved if the system was very busy.

There are also lots of other problems that can lead to kubelet bouncing between ready/not ready that we've observed which wouldn't be triggered by the limits.

Even without the bug it will have negative effect on latency and generally is not really needed for un-metered workloads (there’re posts by thockin on reddit and github that describe this in detail)

To answer your other question - I believe kops ships without system reserved by default

Can you explain how having a CPU limit set (at any level) has a negative effect on latency? That's an important factor to understand.

The arguments for allowing containers to burst makes plenty of sense to me. I do it on most of my services!

thockin's reddit post for reference: https://www.reddit.com/r/kubernetes/comments/all1vg/on_kuber...

Another interesting bit of context describing some of the non-intuitive impacts of CPU limits: https://github.com/kubernetes/kubernetes/issues/51135

Edit: added links

It’s mentioned elsewhere on this thread but essentially with cfs quota period default which is 100ms it’s really easy for multithreaded process to exhaust the quota and just sit there idle until next period. Another thing is if you have spare cycles that you presumably already paid for why not just use them?
> Removing CPU limits seems like a bad idea now that there's a kernel fix.

Actually, why? Sure those guys may starve the ones without limits but they won't starve each other just because Linux will simply time-share the processes. And for services on the critical path (what they turned it off for) that seems like correct behaviour.

I should have said that it seems like the wrong fix to the problem. But I have since learned that limits can cause excessive throttling. And of course you may want your pods to be burstable, but that would just be a question of setting appropriate limits.

Live and learn!

It's pretty simple, limits work only when everyone are using them. If you have one pod that does not enforce limits it can disrupt the entire node.
A container with a request but without a limit should be scheduled as Burstable, and it should only receive allocations in excess of its request when all other containers have had their demand <= request satisfied.

A container without either request or limit is twice-damned, and will be scheduled as BestEffort. The entire cgroup slice for all BestEffort pods is given a cpu.shares of 2 milliCPUs, and if the kernel scheduler is functioning well, no pod in there is going to disrupt the anything but other BestEffort pods with any amount of processor demand. Throw in a 64 thread busyloop and no Burstable or Guaranteed pods should notice much.

Of course that's the ideal. There is an observable difference between a process that relinquishes its scheduler slice and one that must be pre-empted. But I wouldn't call that a major disruption. Each pod will still be given its full requested share of CPU.

If that's not the case, I'd love to know!

Are you sure that BestEffort QOS do not disrupt the entire node? I remember in the past a single pod would freeze the entire VM.
I wrote a little fork+spinloop program w/100 subprocesses and deployed it with a low (100m) CPU request and no limit. It's certainly driving CPU usage to nearly all 8 of the 8 cores on the machine, but the other processes sharing the node are doing fine.

Prometheus scrapes of the kubelet have slowed down a bit, but are still under 400ms.

Prometheus scrape latency for the node kubelet has increased, but not it's still sub-500ms.

Note that this cluster (which is on EKS) does have system reserved resources.

    [root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/system.slice/cpu.shares
    1024
    [root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/kubepods/cpu.shares
    8099
    [root@ip-10-1-100-143 /]# cat /sys/fs/cgroup/cpu/user.slice/cpu.shares
    1024
Could you please elaborate on why's that so?