|
|
|
|
|
by fafle
1710 days ago
|
|
The issue of running a transaction that spans multiple heterogeneous systems is usually solved with a 2 phase commit. The "jobs" abstraction from the article looks similar to the "coordinator" in 2PC. The article does not talk about how they achieve fault tolerance in case the "job" crashes inbetween the two transactions. Postgres supports the XA standard, which might help with this. Kafka does not support it. |
|
Commit the checkpoint alongside state mutations in a single store transaction. Only then do you publish ACKs to all of the downstream streams.
Of course, you can fail immediately after commit but before you get around to publishing all of those ACKS. So, on recovery, the first thing a task assignment does is publish (or re-publish) the ACKs encoded in the recovered checkpoint. This will either 1) provide a first notification that a commit occurred, or 2) be an effective no-op because the ACK was already observed, or 3) roll-back pending messages of a partial, failed transaction.
More details: https://gazette.readthedocs.io/en/latest/architecture-exactl...