Hacker News new | ask | show | jobs
by lclarkmichalek 944 days ago
This still isn't what I'd call "safe". Retries are amazing at supporting clients in handling temporary issues, but horrible for helping them deal with consistently overloaded servers. While jitter & exponential backoff help with the timing, they don't reduce the overall load sent to the service.

The next step is usually local circuit breakers. The two easiest to implement are terminating the request if the error rate to the service over the last <window> is greater than x%, and terminating the request (or disabling retries) if the % of requests that are retries over the last <window> is greater than x%.

i.e. don't bother sending a request if 70% of requests have errored in the last minute, and don't bother retrying if 50% of the requests we've sent in the last minute have already been retries.

Google SRE book describes lots of other basic techniques to make retries safe.

2 comments

Finagle fixes this with Retry Budgets: https://finagle.github.io/blog/2016/02/08/retry-budgets/
Totally! Thanks for bringing those up. I tried to keep the scope specifically on retries and client-side mitigation. There's a whole bunch of cool stuff to visualise on the server-side, and I'm hoping to get to it in the future.
Your response makes it sound like you think circuit breakers are server side and not related to retries. They are not; they are a client-side mitigation that are a critical part of a mature retry library.
The client can track its own error rate to the service, but it would need information from a server to get the overall health of the service, which is what the author probably means. Furthermore the load balancer can add a Retry-After header to have more control over the client's retries.
I think I've misunderstood what circuit breakers are for years! I did indeed think they were a server-side mechanism. The original commenter's description of them is great, you can essentially create a heuristic based on the observed behaviour of the server and decide against overwhelming it further if you think it's unhealthy.

TIL! Seems like it can have tricky emergent behaviour. I bet if you implement it wrong you can end up in very weird situations. I should visualise it. :)

I mean, they can and should be both. Local decisions can be cheap, and very simple to implement. But global decisions can be smarter, and more predictable. In my experience, it's incredibly hard to make good decisions in pathological situations locally, as you often don't know you're in a pathological situation with only local data. But local data is often enough to "do less harm" :)
Do you have a newsletter?
Not a newsletter as such but I do have an email list where I post whenever I write something new. You can find it here: https://buttondown.email/samwho