Hacker News new | ask | show | jobs
by cryptica 2419 days ago
I like the elegance and simplicity of two-phase commits. I didn't understand the criticism in the article; maybe it's something specific to CockroachDB. In my experience with two-phase commits, if the system crashes before a transaction is fully processed and committed, it should be fully reprocessed (from scratch) once the server restarts.

In one of my open source projects with RethinkDB (which doesn't natively support atomic transactions), I implemented a distributed 'parallel' two-phase commit mechanism by assigning each pending record a shard key (an integer derived from a hash of the record id); then worker servers would decide which subset/range of DB records to process/commit based on a hash of their own server id (which would tell them which range of shard keys/records they were responsible for). Only when a record had been fully processed, its status would be updated as committed by writing a 'settled' flag on that record.

Whenever a server failed and restarted, it would pick up processing from the last successful commit. If a server did not restart, the worker count would be updated and remaining workers would redistribute the partitions among themselves based on the shard keys of the records.

4 comments

The "criticism" as it pertains to CockroachDB and off-the-shelf 2PC is less so about 2PC in isolation and more so about layering 2PC on top of consensus groups used to persist records. When any given txn does the "prepare" phase, it lays down markers for a possible upcoming commit. If the 2PC coordinator fails in an inopportune moment, there's a delay between the failure and the markers being cleaned up (whether or not the transaction is "reprocessed" or aborted). The reason why this delay is problematic is because any subsequent transactions that happen upon said markers, they just have to wait for the resolution ("commit"/"aborted") aka it blocks. So clearly recovery must be built into 2PC, i.e. the transaction state itself must be persisted. This is done so in the same way the markers/regular writes are, through consensus. But marking the transaction state as "committed" can only happen once we're guaranteed that all the individual write markers are persisted. Which adds a second round of consensus.
This is about reducing the number of message delays before the commit succeeds. Failure scenarios have to be handled to be correct, but this is a performance optimization primarily from what I can see.
> In my experience with two-phase commits, if the system crashes before a transaction is fully processed and committed, it should be fully reprocessed (from scratch) once the server restarts.

Then the problem becomes, in a fully distributed system how do you know which was the last successful commit before the system crashed?

That 'settled' flag you mention may be set in the coordinator node's storage just before the coordinator crashed, but not communicated to any other node because of the crash.

So the other nodes have to wait for the coordinator to restart, and replay its storage, to find out if that transaction was successful.

That pause can be quite long, especially if it involves a coordinator rebooting, and really long if it involves resynchronising a RAID, checking filesystems/DB integrity, etc.

The other nodes could decide it doesn't matter, exclude the coordinator from the cluster, and reprocess starting from an older transaction. But then they have to agree which older transaction - a distributed consensus problem.

This is certainly possible, but I wouldn't call it simple, and it wouldn't be two-phase commit any more.

Restarting older transactions tends to be visible to clients as well, because they may see transaction retries, and it's often desirable to minimise the number of those from a distributed database for minor failures (such as one node harmlessly going offline). Among other things, they can cause load spikes in the large system, and may exercise retry corner cases that don't come up normally, surfacing bugs that shouldn't be there in application code, but are.

> Whenever a server failed and restarted, it would pick up processing from the last successful commit. If a server did not restart, the worker count would be updated and remaining workers would redistribute the partitions among themselves based on the shard keys of the records.

You don't describe how you come to a global view of which is the "last successful commit". Do you mean that the co-ordinator recovers (such as from its own log) and uses that information, or do you mean you use local information? In the latter case, what you describe doesn't sound like it preserves transaction atomicity across node crashes.

In RethinkDB, a table can be split into shards and each shard has at most one master and multiple replicas. With one master to perform all writes for a shard, there is no need for a co-ordinator. If the master of a shard goes down, a replica can be promoted as the master. For writes, RethinkDB prioritizes consistency over availability so a server failure may carry some downtime in terms of writes but reads can be configured to be highly available (though they could potentially be out of date).

But there is definitely a downside that if a RethinkDB master server fails immediately after a write (before it propagated to a replica) and does not recover, then there can be some data loss (but only the most recent writes on that shard).

So if I'm reading this right, the master process can lose writes (if there's permanent failure). So taking your 2PC scheme above with the 'settled' flag, you can respond to the client that the txn has been committed once you've written the flag, but this commit marker can also just be entirely lost? (again, permanent failure)

If this 'settled' flag exists on each 'shard' instead of just the coordinator, any random subset of those too can be lost? I don't understand what's going on here, or what guarantees this 2PC implementation provides.

Actually, I may be mistaken about my previous commment. I'm not completely sure if this loss of recent data would happen as I've described. It depends on client implementation. For example, a client could wait for a write to propagate to at least 1 replica before telling the caller that the data was inserted successfully. This is an implementation detail I'm not sure about.

Also the settled flag exists on each record, not each shard. A shard is typically made up of multiple unsettled records. Each worker is assigned to a shard using a hash function so it's deterministic and the worker only processes unsettled transactions from their own shard.

Also I said something else misleading in one of my previous comments. In my case, the shard key of each record (which determines which shard a record belongs to) was not based on its own record ID but on the account ID of the user who owns that record. So effectively the sharding was happening based on user accounts and it was designed so that the records created by an account could be processed independently of records created by a different account.