Hacker News new | ask | show | jobs
by danenania 2245 days ago
Do sagas help with rolling back in response to errors? This seems like the nastiest aspect of any distributed transaction approach: step A succeeds, step B succeeds, step C fails, call to rollback A and B fails... and now?

Or you do a two-phase commit: A, B, and C tentatively succeed, but then one of the commit calls fails, and now?

It seems like inconsistencies are inevitable no matter what you do.

3 comments

The post you're replying to does list compensating transactions (a form of rollback)

One gotcha that is not covered by Sagas (I could be wrong) is when one or many of the network paths involved in the distributed tx become unreachable (network partition event) and you have no idea of the state of that part of the tx. Do you re-try that part and risk sending the same instruction twice (ok in some cases but not all) vs risk of having sent no instruction? If I had to implement a distributed tx I would first verify my mental model using TLA+ and use a (persistent) transactional messaging system with at-least-once delivery as the backbone, and make other accommodations for such scenarios.

Do you re-try that part and risk sending the same instruction twice (ok in some cases but not all) vs risk of having sent no instruction?

If you can make your compensating action idempotent, then yes, you can just keep retrying it. If it can't be made so for whatever reason, then a failure at that point demands manual intervention.

I suppose redundant communication channels (that go over different network modalities, e.g, data center native, satellite, 5G, etc) can be used to recover from network partition. Still, having a protocol with at-least-once delivery guarantee is important as it assures that no messages are lost due to unexpected crash of sender/caller or receiver/callee.
It seems like inconsistencies are inevitable no matter what you do.

At some level, barring guaranteed message delivery (which is effectively non-existent in any distributed system) you always reach a level where you can't guarantee consistency. It's the Byzantine General's Problem, basically.

https://en.wikipedia.org/wiki/Byzantine_fault

But based on empirical evidence, you can work out that a certain measure of effort dedicated to fault tolerance will yield correct results in X% of cases, and you can tune the value of X based on how much time/energy/money/effort you're willing to expend... up to a point.

Yes, it does... something similar enough.

Accounting has been doing that for centuries already, so it's not new by any means. It's also not free, it imposes severe restrictions on your system's architecture and the kinds of problems it can solve.