Hacker News new | ask | show | jobs
by seminatl 2400 days ago
This does not really have anything to do with either kubernetes or networks. If your computer is busy, it won't be able to process packets. Accessing certain kernel stats via proc, sys, or other special files may be really expensive. For example /proc/pid/smaps of a running mysqld takes 2 seconds on a computer I happen to have on hand. Sometimes when you have many cores it is expensive to produce some of the fields of /proc/pid/stat because the kernel has to visit numerous per-cpu data structures. /proc/pid/statm is better for this reason, if it contains what you are looking for.

TL;DR reading kernel stats can take a long time and cost a lot of CPU cycles. It costs more for more containers, and more on bigger machines.

4 comments

True, it does not have anything to do with k8s or networks, but, that's the context in which this issue arose: when they noticed a higher network latency on a kubernetes cluster.

The value of this blog post isn't only in the "why" ("reading kernel stats can take a long time and cost a lot of CPU cycles"), but also in the "how" they went about finding the cause of the symptoms is also of interest.

> This does not really have anything to do with either kubernetes or networks.

The article is literally about an issue that was experienced while operating kkbernetes clusters.

FTA:

> Essentially, applications running on our Kubernetes clusters would observe seemingly random latency of up to and over 100ms on connections, which would cause downstream timeouts or retries.

Sounds like a problem affecting Kubernetes to me, and an important one.

More importantly, it sounds like a non-trivial problem that others operating Kubernetes clusters would be interested in learning how to identify and how to search for the root cause.

It has literally nothing to do with K8s. An equally suitable title would have been "Debugging network stalls on the Intel Xeon processor" or "Debugging network stalls on planet Earth".
I feel like the title is somewhat appropriate: part of the post was about selectively removing parts of Kubernetes networking, and digging through the kernel networking stack, to troubleshoot the issue
Simple answer really - anything with Kubernetes in the title gets more clicks.