There's a subtle insight that could be added to the post if you consider worth it, and it's something that's actually there already, but difficult to realize: Clients in your simulation have an absolute maximum number of retries.
I noticed this mid-read, when looking at one of the animations with 28 clients, that they would hammer the server but suddenly go into wait state, without apparent reason.
Later in the final animation with debug mode enabled, the reason becomes apparent for those who click on the Controls button:
Retry Strategy > Max Attempts = 10
It makes sense, because in the worst case when everything goes wrong, a client should reach a point where it desists and just aborts with a "service not available" error.
Exponential retries can effectively have a maximum number of requests if the gap between retries gets long enough quickly enough. In practice, the user will refresh or close the page if things look broken for too long.
Unbounded exponential backoff is an horrible experience, and improves basically nothing.
If it makes sense to completely fail the request, do it before the waiting becomes noticeable. If it's something that can't just fail, set a maximum waiting time and add jitter.
I think decoupling retry logic from the “there’s something wrong” UI ends up being a better experience than tieing the UI state to the details of network retries. (For one thing, it gives you a chance to fix the “everything is broken” UI without any action on the user’s part.
One thing I noticed is that the post is very first-principles right up to where it reaches exponential backoff. At that point, it quickly jumps to "and here's exponential backoff and here's some good parameters". But I've worked on a lot of systems that got those wrong. In both directions: too-short caps that were insufficient for the underlying system to recover and too-long caps so that even when the servers _did_ recover, clients weren't even going to try again for way too long (e.g., 2 days). It'd be neat to have another section or two exploring those tradeoffs.
I really want one of these visual explorations for the idea of margin. Concretely: it's common to have systems at, say, 88% CPU utilization that appear to be working great. Then you ramp them up to like 92% and start seeing latency bubbles of multiple seconds or even tens of seconds. We tend to think of that idle time as waste, but it's essential for surviving transient blips in load. I increasingly feel like this concept is really fundamental and ought to be taught in like high school because it applies so many places (e.g., emergency funds, in the realm of personal finance).
What technology did you use for the animations? I've a bunch of itches I'd like to scratch that would be improved by having some canvas animated explainers or UI but I never clicked with anything. D3 back in the day.
A rudimentary look in the source code showed a <traffic-simulation/> element but I'm not up to date enough with web standards to guess where to look for that in your JS bundle to guess at the framework!
I've been thinking about creating a separate repo to house the source code of posts I've finished so people can see it. I don't like all the bundling and minification but sadly it serves a very real purpose to the end user experience (faster load speeds on slow connections).
Until then feel free to email me (you'll find my address at the bottom of my site) and I'd be happy to share a zip of this post with you.
I noticed this mid-read, when looking at one of the animations with 28 clients, that they would hammer the server but suddenly go into wait state, without apparent reason.
Later in the final animation with debug mode enabled, the reason becomes apparent for those who click on the Controls button:
Retry Strategy > Max Attempts = 10
It makes sense, because in the worst case when everything goes wrong, a client should reach a point where it desists and just aborts with a "service not available" error.