Hacker News new | ask | show | jobs
by mrtracy 3945 days ago
Hi there, blog author here!

This blog post is not a full description of CockroachDB's transaction strategy. If you're interested, you can find a complete rundown of our strategy here: (https://github.com/cockroachdb/cockroach/blob/master/docs/de...). However, that is a very terse document: this blog post was an attempt to expand on one of the basic ideas behind our design. My goal to clearly explain how we approach atomicity, without broaching the complicated topic of Isolation.

That said, I will attempt to address some of the pain points you raised:

* Cockroach does use a full MVCC system; writes are "append-only" and we keep the full history of values for each key. Keys can be queried at a timestamp, and every write explicitly specifies a timestamp as well (each transaction is assigned a timestamp, all writes will use the same timestamp). With this system, each transaction operates on a snapshot of the database.

* Our strategy for overlapping transactions is straightforward: if two concurrent transactions have a conflict, one of them will be aborted and retried. Write intents (in combination with another component, the "read timestamp cache") are used by conflicting transactions to discover each other. Once discovered, the decision of which in-progress transaction gets aborted is deterministic - each transaction is assigned a random integer "priority", and the transaction with the highest priority always wins the conflict (unless one of the transactions has already committed).

* Every read does need to check the switch if the key being read has a write intent; reads of plain values (with no in-progress transaction) do not need to check a switch.

* Yes, aborted transactions can write intents to the database in our current system. However, the second "write" (consolidation) happens asynchronously, and is not required for the transaction to function correctly. It should only be noticed if the key is accessed again by another transaction before it is consolidated. For committed transactions, I don't believe this is a particularly high price to pay.

A system that doesn't write aborted transactions to disk would be considerably more complex; I think it would involve two commit phases (I. The transaction has been "committed" in memory, and can no longer be aborted by other transactions II. The transaction must successfully commit all changes to disk in an atomic fashion). You likely could get some additional performance (if disk latency is a sufficient bottleneck), but it would be hard won in terms of complexity.

1 comments

Hi, thanks for the reply!

I will read the more detailed document. It would probably make more sense to me.

From the description of MVCC that you say, it looks like you're storing the whole replicated state machine that you have via the Raft protocol. If txs are assigned such a timestamp, I agree that could operate as a full MVCC system. Good.

Regarding the overlapping txs, it makes sense once you say you're aborting txs. Without that, it sounded really difficult to achieve without locking.

When you say reads only check the switch if there is a write intent, where is exactly that "switch"? I understood it's not with the key, but rather at some other random location. If that's the case, I insist that you require two reads: one for the key, the other one for the switch. Os is the switch co-located with every key?

Regarding the aborted txs, I can't argue without numbers, but I'd like to see those. I mean: you may abort txs voluntarily (I want to ROLLBACK) or involuntary (there's a concurrent conflict). That may or may not be a high number. But all of those would be written to disk, so it might become a high price.