Hacker News new | ask | show | jobs
by toast0 130 days ago
Retry storms are "easy" exponential backoff with jitter. Like what ethernet on shared media has been doing since the 80s.

If that's not enough to come back from an outage, you need to put in load shedding and/or back pressure. There's no sense accepting all the requests and then not servicing any in time.

You want to be able to accept and do work on requests that are likely to succeed within reasonable latency bounds, and drop the rest --- but being careful that an instant error may feed back into retry storms, sometimes it's better if such errors come after a delay, so that the client is stuck waiting (back pressure)

1 comments

Agree backoff+jitter is table stakes, and load shedding/backpressure is necessary under sustained overload. The tricky cases I’m digging into are shared rate limits (429s) and many concurrent clients/agents where local backoff isn’t coordinated and you still get herds after partial outages. Curious what patterns you’ve seen work well for coordinating retries/fairness across tenants or API keys?