Hacker News new | ask | show | jobs
by wavemode 625 days ago
I think attempting to automatically "undo" partially failed distributed operations is fraught with danger.

1. It's not really safe, since another system could observe the updated data B (and perhaps act on that information) before you manage to roll it back to A.

2. It's not really reliable, since the sort of failures that prevent you from completing the operation, could also prevent you from rolling back parts of it.

The best way to think about it is that if a distributed operation fails, your system is now in an indeterminate state and all bets are off. So if you really must coordinate updates within two or more separate systems, it's best if either

a) The operation is designed so that nothing really happens until the whole operation is done. One example of this pattern is git (and, by extension, the GitHub API). To commit to a branch, you have to create a new blob, associate that blob to a tree, create a new commit based on that tree, then move the branch tip to point to the new commit. As you can see, this series of operations is perfectly fine to do in an eventually-consistent manner, since a partial failure just leaves some orphan blobs lying around, and doesn't actually affect the branch (since updating the branch itself is the last step, and is atomic). You can imagine applying this same sort of pattern to problems like ordering or billing, where the last step is to update the order or update the invoice.

b) The alternative is, as you say, flag for manual intervention. Most systems in the world operate at a scale where this is perfectly feasible, and so sometimes it just makes the most sense (compared to trying to achieve perfect automated correctness).

2 comments

Undo may not always be possible or appropriate, but you do need to consider the edge case where an action cannot be applied fully, in which case a decision must be made about what to do. In OP's example, failure to roll back would grant the user points but no discount, which isn't a nice outcome.

Trying to achieve a "commit point" where changes become visible only at the end of all the updates is worth considering, but it's potentially much more complex to achieve. Your entire data model (such as database tables, including index) has to be adapted to support the kind of "state swap" you need.

>I think attempting to automatically "undo" partially failed distributed operations is fraught with danger.

We once tried it, and decided against it because

1) in practice rollbacks were rarely properly tested and were full of bugs

2) we had a few incidents when a rollback overwrote everything with stale data

Manual intervention is probably the safest way.