Hacker News new | ask | show | jobs
by jrockway 1164 days ago
So this is a pretty common cascading failure scenario. Even ignoring CPU limits, if your service gets slow when it's over capacity, this will almost always happen. Latency increases to the point where liveness probes fail, causing the size of the fleet to decrease because of liveness-induced restarts, causing the other replicas to experience more load, causing them to become slow enough to fail liveness probes, and soon enough, everything is dead.

Kubernetes can only do so much for you here. Liveness probes are designed to restart categorically broken software; for example, a combination of two requests causes no further requests to be handled. Maybe that's rare enough that a simple restart is an improvement over a replica that times out all requests directed at it. (You can fortunately see this behavior in real-world scenarios. You can also architect your application to self-check, of course, but the common "if path == '/healthz' { response.WriteHeaders(200) }" isn't this.) Readiness probes can shed load, but only by loading the other replicas by taking this replica's endpoints out of the service until things calm down. If the system as a whole doesn't have enough capacity, then picking one replica and saying "you can rest for 5 minutes" is just going to cause the other replicas to become overloaded and for the whole system to eventually fail.

There are other techniques here that work better.

Rate limiting is very common inside Big Tech; when a calling service induces too much load, it's told to simply go away via a fast path. That can prevent the thundering herd by allowing a % of requests to make progress, while other requests are rejected. Some progress is made while the system is degraded, and if there is spare capacity and a buffer, eventually the buffer is drained. (This post is too long to rant about buffering in distributed systems and what backpressure is, but if a buffer size of 1 can become full, then a buffer of any size can become full. So buffering is rarely a solution, but often the cause of outages.)

Circuit breaking is also common, where when a significant fraction of requests end with 5xx (usually a timeout), the load balancer just fast-paths a 5xx response for that replica's share of requests. This actually reduces load on the system, allowing it to process some requests instead of becoming a fleet of replicas in CrashLoopBackoff.

CPU limits are another complicating factor, but not much of one. Every piece of software runs with a CPU limit; only a finite number of CPUs can fit in your data center, or the Universe for that matter. A common problem that people run into is multithreaded software that doesn't understand that it's CPU limited. This does not cause failures, but typically induces a weird tail latency. CPU limits are enforced at discrete intervals; every 100ms, you're allowed to use 1 CPU. But you're also allowed to use 10 CPUs every 10ms, and sit idle for 90ms. (The system will enforce this; you may want to do work on 10 CPUs, but you're going to sleep after that first 10ms burst.) Usually, your system can be architected with CPU limits in mind; for example, by setting something like GOMAXPROCS to the CPU limit instead of the number of physical CPUs, avoiding the ability to consume the time allotted before the accounting interval ends. But, these mistakes very rarely lead to cascading failure, just very confusing 99.9%-ile latency numbers when under load, and a request spans that forced-idle interval.

Anyway, I have laid all of this foundation so I can get to my rant. There are a lot of "Kubernetes best practices" out there, and two that I have run into are that all applications must have a liveness probe, and that all applications must run at a Guaranteed QoS (and have cpu request == cpu limit != 0). These are interesting things to think about, but not a guaranteed way to enhance reliability (or lower cost). Your workload might be burstable, in which case a Burstable QoS might be exactly what you need; you trade reliability (a guarantee that all containers will be able to use a certain amount of CPU) for efficiency (you can dip into foobar service's CPU shares when barbaz needs to do a rare high-CPU activity). Liveness probes can be good too, where you have a single-threaded event loop that can get wedged accidentally, and restarting is the only way out. But, neither practice can be blindly applied to every workload that can be run in a container.