Hacker News new | ask | show | jobs
by atombender 625 days ago
I find the "forwarder" system here a rather awkward way to bridge the database and Pub/Sub system.

A better way to do this, I think, is to ignore the term "transaction," which overloaded with too many concepts (such as transactional isolation), and instead to consider the desired behaviour, namely atomicity: You want two updates to happen together, and (1) if one or both fail you want to retry until they are both successful, and (2) if the two updates cannot both be successfully applied within a certain time limit, they should both be undone, or at least flagged for manual intervention.

A solution to both (1) and (2) is to bundle both updates into a single action that you retry. You can execute this with a queue-based system. You don't need an outbox for this, because you don't need to create a "bridge" between the database and the following update. Just use Pub/Sub or whatever to enqueue an "update user and apply discount" action. Using acks and nacks, the Pub/Sub worker system can ensure the action is repeatedly retried until both updates complete as a whole.

You can build this from basic components like Redis yourself, or you can use a system meant for this type of execution, such as Temporal.

To achieve (2), you extend the action's execution with knowledge about whether it should retry or undo its work. For such a simple action as described above, "undo" means taking away the discount and removing the user points, which are just the opposite of the normal action. A durable execution system such as Temporal can help you do that, too. You simply decide, on error, whether to return a "please retry" error, or roll back the previous steps and return a "permanent failure, don't retry" error.

To tie this together with an HTTP API that pretends to be synchronous, have the API handler enqueue the task, then wait for its completion. The completion can be a separate queue keyed by a unique ID, so each API request filters on just that completion event. If you're using Redis, you could create a separate Pub/Sub per request. With Temporal, it's simpler: The API handler just starts a workflow and asks for its result, which is a poll operation.

The outbox pattern is better in cases where you simply want to bridge between two data processing systems, but where the consumers aren't known. For example, you want all orders to create a Kafka message. The outbox ensures all database changes are eventually guaranteed to land in Kafka, but doesn't know anything about what happens next in Kafka land, which could be stuff that is managed by a different team within the same company, or stuff related to a completely different part of the app, like billing or ops telemetry. But if your app already knows specifically what should happen (because it's a single app with a known data model), the outbox pattern is unnecessary, I think.

1 comments

I think attempting to automatically "undo" partially failed distributed operations is fraught with danger.

1. It's not really safe, since another system could observe the updated data B (and perhaps act on that information) before you manage to roll it back to A.

2. It's not really reliable, since the sort of failures that prevent you from completing the operation, could also prevent you from rolling back parts of it.

The best way to think about it is that if a distributed operation fails, your system is now in an indeterminate state and all bets are off. So if you really must coordinate updates within two or more separate systems, it's best if either

a) The operation is designed so that nothing really happens until the whole operation is done. One example of this pattern is git (and, by extension, the GitHub API). To commit to a branch, you have to create a new blob, associate that blob to a tree, create a new commit based on that tree, then move the branch tip to point to the new commit. As you can see, this series of operations is perfectly fine to do in an eventually-consistent manner, since a partial failure just leaves some orphan blobs lying around, and doesn't actually affect the branch (since updating the branch itself is the last step, and is atomic). You can imagine applying this same sort of pattern to problems like ordering or billing, where the last step is to update the order or update the invoice.

b) The alternative is, as you say, flag for manual intervention. Most systems in the world operate at a scale where this is perfectly feasible, and so sometimes it just makes the most sense (compared to trying to achieve perfect automated correctness).

Undo may not always be possible or appropriate, but you do need to consider the edge case where an action cannot be applied fully, in which case a decision must be made about what to do. In OP's example, failure to roll back would grant the user points but no discount, which isn't a nice outcome.

Trying to achieve a "commit point" where changes become visible only at the end of all the updates is worth considering, but it's potentially much more complex to achieve. Your entire data model (such as database tables, including index) has to be adapted to support the kind of "state swap" you need.

>I think attempting to automatically "undo" partially failed distributed operations is fraught with danger.

We once tried it, and decided against it because

1) in practice rollbacks were rarely properly tested and were full of bugs

2) we had a few incidents when a rollback overwrote everything with stale data

Manual intervention is probably the safest way.