|
|
|
|
|
by dap
940 days ago
|
|
Thanks for this -- it's really great! One thing I noticed is that the post is very first-principles right up to where it reaches exponential backoff. At that point, it quickly jumps to "and here's exponential backoff and here's some good parameters". But I've worked on a lot of systems that got those wrong. In both directions: too-short caps that were insufficient for the underlying system to recover and too-long caps so that even when the servers _did_ recover, clients weren't even going to try again for way too long (e.g., 2 days). It'd be neat to have another section or two exploring those tradeoffs. I really want one of these visual explorations for the idea of margin. Concretely: it's common to have systems at, say, 88% CPU utilization that appear to be working great. Then you ramp them up to like 92% and start seeing latency bubbles of multiple seconds or even tens of seconds. We tend to think of that idle time as waste, but it's essential for surviving transient blips in load. I increasingly feel like this concept is really fundamental and ought to be taught in like high school because it applies so many places (e.g., emergency funds, in the realm of personal finance). |
|