| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tyingq 946 days ago
	This is one of those things that sort of exposes our industry maturity versus other engineering that's been around longer. You would think by now that the various frameworks for remote calls would have standardized down to include the best practice retry patterns, with standard names, setting ranges, etc. But we mostly still roll our own for most languages/frameworks. And that's full of footguns around DNS caching, when/how to retry on certain failures (unauthorized, for example), and so on. (Yes, there should also be the non-abstracted direct path for cases where you do want to roll your own).

1 comments

rewmie 945 days ago

> You would think by now that the various frameworks for remote calls would have standardized down to include the best practice retry patterns, with standard names, setting ranges, etc.

There is a school of thought that argues that the best retry pattern is no retry at all, and just get the client to fail and handle that state.

One of the driving arguments is that retries are a lazy way to try to move faults from the client onto the server, and in the process cause more harm (i.e., DDoS).

Sometimes complex means wrong, and all these retry strategies are getting progressively more complex at the expense of hammering servers with traffic way beyond the volume it's designed to handle. How is that a decent tradeoff?

link

pixel8account 945 days ago

I disagree. I think the trade-off is very reasonable. At some point you need to retry (even if the trigger is user manually pressing F5 in the browser/clicking a button again/running a program again). Because they actually have some goal to accomplish.

Some failures really are random, let's say 0.1% of requests fail. For a sufficiently complex backend/operation, one user request can easily generate 100 internal requests that can fail. If you don't retry, this adds up to a non-negliglible chance that a whole user facing operation fails and all 100 requests have to be retried - you actually increased the number of requests that had to be made! As an extreme example, imagine that during training ChatGPT one request failed, and whole training has to be started from scratch because we don't do retries.

link

rewmie 945 days ago

> I disagree. I think the trade-off is very reasonable. At some point you need to retry (even if the trigger is user manually pressing F5 in the browser/clicking a button again/running a program again). Because they actually have some goal to accomplish.

I don't think your belief holds water if you think about your example. The goal of a retry from a client standpoint is to introduce an acceptable delay in order to pretend the original request was successful. This strategy is only valid if the number of retries are enough to not penalize perceived performance or the normal operational state of a service. Consequently, all retry strategies involve sending multiple requests per second. The link to Retry Budgets posted in this discussion explicitly mentions "a minimum of 10 retries per second."

A user pressing F5 will never generate this volume of requests.

> Some failures really are random, let's say 0.1% of requests fail.

That's why failing fast and not retry is the best strategy for most if not all applications. Retry strategies introduce high levels of complexity to a task that only rarely happens, and in the rare case that it happens it can be trivially fixed by the user triggering a refresh.

If it's an applications that already outputs a high volume of requests, once your first request fails then it will simply post again a request as part of their happy path.

Some developers like retries because they use it to patch their broken code path to pretend that they do not have to deal with scenarios where a network is not 100% reliable. They onboard a retry library, they update their requests to transparently appear to be a single request, and they proceed as if their application doesn't have a failure mode. Except it does, but now they also decide to tradeoff their wishful thinking with higher risk of causing a cascading DDoS attack on their own infrastructure.

link

tyingq 945 days ago

> That's why failing fast and not retry is the best strategy for most if not all applications.

I think it's more complex than this. You also have to lump timeouts, caching and failure behavior into the conversation. And there are also situations where you absolutely need some amount of retries. Say, for example, you want seamless failover between backends...you're expecting some failures and don't want or need to expose those to your end users. Or, maybe the "end user" isn't a person. Like, for example, finalizing a financial transaction from a queue.

link