| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mherdeg 1043 days ago

One of the special treats of operating a large distributed system is getting to watch the graph of requests-versus-time as you do stuff, like enabling http flow-control headers or tweaking load-balancing algorithms or whatever else.

(1) You learn a ton about which production clients actually respect your instructions, and which ones creatively misinterpret them. You learn a lot faster than you would by stepping through the code.

(2) You learn how other systems behave with flow control in place. If you can, you make it so the default behavior is safe for users. This makes it safe for people to practice pushing the button, even when things aren't overloaded, so they remember how to do it and don't feel uncomfortable doing it during a high-load event.

(3) You get to see some really cool shapes -- sinusoids, cliffs, plateaus -- that show you how traffic goes away and how much of it comes back. (You probably learn, early on, to tell different clients to retry after different time windows, and how this shapes the curve.)

Also really satisfying has been observing how a system behaves under load and making a tens-of-lines-of-code change that hugely affects the safety of request handling. We stabilized a large system that was showing bad behavior under load just by teaching it a few things like

(1) Put health checks in a different thread pool than other request handling, so you can always satisfy them quickly

(2) If you're going to return fast failures to certain categories of request when overloaded, insert some debouncing. Evaluate "I am overloaded" over a time window -- like "I have been continuously overloaded for the past 300ms" -- and a lot of signals smooth out in a way that ends up feeling better to users.

This has always been a fun space to work in.