Hacker News new | ask | show | jobs
by daenney 3618 days ago
I think that what they're trying to get at is that having libraries that (for example) wrap failures in retry modes isn't necessarily failing safely. It can very well obscure problems in your implementation or other parts of the systems you're talking to. Having it fail safely can just as well be "abort execution" and visibly log it so as to raise the problems with those that might be able to solve the root cause.

There's certainly something to be said for retry strategies in places that involve a lot of network chatter but please don't also forget to add some kind of back off to it so you don't end up retry-overloading a system that's trying to recover.

2 comments

Pretty much this, yes.

If you hit an error condition in your code that you aren't explicitly handling, break that mofo.

The faster and more explicitly you break, the better, as this gives you the signal to fix the problem.

Wrapping and retries attempts to heal the damage, meaning, effectively, your code is walking wounded -- it's encountered an untrapped error, has ignored it, and is attempting to continue.

The faster and more definitively an error breaks, the better the likelihood of fixing it, and the more obvious the error and fix are.

Author here: fully agree. Blindly retrying an operation any failed operation could lead to cascading failures or system overload, which is what circuit breakers are intended to avoid. Generally, it's just good to think about which failures can or should be retried or recovered from and what recovery should look like. A tool like Failsafe just makes it easier, hopefully, to do what you think is appropriate for the situation.
Thanks for the reply.

I haven't looked in detail at the library, and probalby don't have the chops to identify good or bad features. But the mechanisms described and my understanding of the origins of the concept of "fail safe" seemed at odds, and I wanted to raise the point.

They have a CircuitBreaker. They don't seem to have exponential backoff. But that seems close to correct. For a networked application, what do you think is safer than retrying with exponential backoff and circuit breaking?
Author here, Failsafe does support exponential backoffs [1]:

  retryPolicy.withBackoff(1, 30, TimeUnit.SECONDS);
and if you want to specify the exponent [2]:

  retryPolicy.withBackoff(1, 30, TimeUnit.SECONDS, 1.5);
As for which failure handling strategy is safer or what it means to fail safely, in my experience it not only depends on the use case but the type of failure. Certain exceptions, even in a networked application, can and should be retried or recovered from while others cannot. Sometimes retrying is good, sometimes preventing subsequent executions (via circuit breakers), sometimes falling back to an alternative resource. It's all based on the scenario.

[1]: http://jodah.net/failsafe/javadoc/net/jodah/failsafe/RetryPo...

[2]: http://jodah.net/failsafe/javadoc/net/jodah/failsafe/RetryPo...

Cool! Didn't see that on my quick scan. I agree and will probably use this library. It's needed. Thanks.