|
I think attempting to automatically "undo" partially failed distributed operations is fraught with danger. 1. It's not really safe, since another system could observe the updated data B (and perhaps act on that information) before you manage to roll it back to A. 2. It's not really reliable, since the sort of failures that prevent you from completing the operation, could also prevent you from rolling back parts of it. The best way to think about it is that if a distributed operation fails, your system is now in an indeterminate state and all bets are off. So if you really must coordinate updates within two or more separate systems, it's best if either a) The operation is designed so that nothing really happens until the whole operation is done. One example of this pattern is git (and, by extension, the GitHub API). To commit to a branch, you have to create a new blob, associate that blob to a tree, create a new commit based on that tree, then move the branch tip to point to the new commit. As you can see, this series of operations is perfectly fine to do in an eventually-consistent manner, since a partial failure just leaves some orphan blobs lying around, and doesn't actually affect the branch (since updating the branch itself is the last step, and is atomic). You can imagine applying this same sort of pattern to problems like ordering or billing, where the last step is to update the order or update the invoice. b) The alternative is, as you say, flag for manual intervention. Most systems in the world operate at a scale where this is perfectly feasible, and so sometimes it just makes the most sense (compared to trying to achieve perfect automated correctness). |
Trying to achieve a "commit point" where changes become visible only at the end of all the updates is worth considering, but it's potentially much more complex to achieve. Your entire data model (such as database tables, including index) has to be adapted to support the kind of "state swap" you need.