Hacker News new | ask | show | jobs
by ComodoHacker 2418 days ago
From what I understand, with this new commit protocol they managed to improve response time of a writing transaction by shifting some work of determining its final status to the readers. Am I understanding correctly that readers' performance will degrade by the same amount?

While this is an achievement which some scenarios will definitely benefit from, like bulk loading or updating secondary indexes as mentioned in the article, what about other scenarios where read performance is more important, like the ones where there are much more readers than writers? Shouldn't there be a configuration option which commit protocol to use?

1 comments

> Am I understanding correctly that readers' performance will degrade by the same amount?

Not quite. The "slow path" talked about is only applicable when the coordinator node is unavailable (presumably a rare event). If it's unavailable, there's nobody left to clean up the STAGING txn record, so the reader is tasked to do it itself.

In normal conditions however, once the coordinator node receives acknowledgement for the successful persisting of all write intents and the "txn STAGING" record, it can simply record "txn COMMITTED" in memory (and return to the client, send off async intent resolution procedures, etc.) Any subsequent read requests that observes left over intents (yet to be resolved) are pointed to the coordinator node, which can simply consult the "txn COMMITTED" record in memory. This is all safe because the commit marker is not simply stored on the coordinator node, it's a distributed condition and can be reconstructed by any observer even if the coordinator failed.

>Any subsequent read requests that observes left over intents (yet to be resolved) are pointed to the coordinator node, which can simply consult the "txn COMMITTED" record in memory.

Another roundtrip performed by reader rather than writer? That's what I'm talking about.

Though I understand it differently, as reader just waits until all writes and "txn COMMITTED" record arrive at it's node.

> Though I understand it differently, as reader just waits until all writes and "txn COMMITTED" record arrive at it's node.

Actually, welp, I believe this is closer to the truth in implementation. But consider how read performance compares before and after introduction of Parallel Commits. Before, when readers happen over extant 2PC "prepare" phase markers, they would still have to wait for txn resolution (on the coordinator node of the other txn, or on the node the intent was seen). They simply continue doing the same in Parallel Commits, there's no extra latency added to the read path (except when there is failure, but even then as soon as the earlier txn is recovered, future readers no longer get stuck).

>Before, when readers happen over extant 2PC "prepare" phase markers, they would still have to wait for txn resolution

Again, I understand it differently, but maybe I'm wrong. A reader upon encountering unresolved intent looks up corresponding transaction record. Before Parallel Commits, it's either marked COMMITTED or PENDING. If it's PENDING, reader just ignores it and skips its data, since they use MVCC. There's no waiting here.

Now with parallel commits, transaction record can also be marked STAGING, in which case the reader cannot determine if it's commited without additional work and/or waiting (the author doesn't go much into details).

> If it's PENDING, reader just ignores it and skips its data, since they use MVCC. There's no waiting here.

I think this is where the confusion is coming from. You're correct that a read can simply ignore writes, even pending ones, at higher timestamps due to MVCC. This improves transaction concurrency.

However, if a read finds a provisional write (an intent) at a lower timestamp, it can't just ignore it. It needs to know whether to observe the write or not. So it looks up the write's transaction record and may have to wait. If the write transaction is not finalized then it needs to either wait on the transaction to finish or force the transaction's timestamp up above its read timestamp. This is true regardless of parallel commits or not.

What parallel commits gets us is a faster path to transaction commit, as irfansharif pointed out below. So the write can not only be committed faster with parallel commits, but it can also be resolved faster to get out of other reads' ways. In that way, it improves both the synchronous latency profile and the contention footprint of transactions, assuming no coordinator failures.

Thank you for clarification.
If I understood correctly, the extra round trip on the reader only occurs with left-over intents, which are the product of an earlier failure.

So:

- Writes are faster due to fewer round trips in normal running.

- Reads are the same speed in normal running.

- After coordinator failure events, the new coordinator starts to clean up left-over intents asynchronously (nothing specific is waiting for it to finish).

- Reads are slowed by extra round trips in the short time after a failure event, until the left-over intents are cleaned up. But only in that time period, and only for those ranges touched by transactions during the failure event.

- The cost of extra round trips done by the reader just during recovery is much less important than the round trips that happened on every cross-region write with the older algorithm.

- But if you care about read latency being consistent all the time, including during recovery from coordinator failure events, maybe you need a more sophisticated high-availability configuration for the logical coordinator.

> If I understood correctly, the extra round trip on the reader only occurs with left-over intents, which are the product of an earlier failure.

Left over intents are also visible for an ongoing txn, before those intents are resolved. But there's no added latency in the read path, I've commented elsewhere in the thread to explain how.

> After coordinator failure events, the new coordinator starts to clean up left-over intents asynchronously (nothing specific is waiting for it to finish).

The cleanup happens by readers on demand, there's no separate global coordinator scanning the keyspace and resolving old write intents.

> Left over intents are also visible for an ongoing txn, before those intents are resolved. But there's no added latency in the read path, I've commented elsewhere in the thread to explain how.

Thanks, that was a helpful comment.

> The cleanup happens by readers on demand, there's no separate global coordinator scanning the keyspace and resolving old write intents.

Oh, that's a little surprising. I assumed the coordinator(s) did so because asynchronous cleanup is mentioned numerous times in the article, but upon closer scrutiny I see now that it only applies in the after phase of transactions without a failure.

Would that scanning, analogous to RAID "resilvering" subject to write-intent ranges to limit the keyspace regions scanned, usefully improve read latencies later?

> I see now that it only applies in the after phase of transactions without a failure.

Yep.

> Would that scanning, analogous to RAID "resilvering" subject to write-intent ranges to limit the keyspace regions scanned, usefully improve read latencies later?

I think it's just a better design to have it done on demand. The keyspace is large and failures are rare, and when one of these zombie intents are happened upon, the very first reader addressing it resolves it for all subsequent readers. A global scan would improve read latencies later, but not by much and not for many readers.