|
|
|
|
|
by matthewaveryusa
3811 days ago
|
|
Great to see fresh ideas! One thing I don't like in the presentation is that tapir is presented as doing better than what is out there without stating the conditions. First off there's quite a bit of hand-waving when it comes to leadership bottlenecks -- please assume that sane sharding is occurring. I'm not entirely sure transactions spanning partitions is something unique to say paxos. The abort rate vs. contention seem off. looking at the paper all the nodes are in a single datacenter. I would love to see these numbers where tapir is spread across larger geographical regions. My suspicion is that at higher latency will negatively impact the abort rate more so than with a strong leader. What about poor clock synchronization conditions? What about testing with a variety of client latencies? Since the client is effectively acting as a leader in tapir, the client is, in some ways, contending with other clients and the abort rate may be correlated to client latency. I don't think high-latency clients observe this same correlation than with a strong leader. I wish more of the compromises were presented. |
|
The paper has an evaluation for multi-data-center replication in Figure 12. We assume that the clients are web servers, so they are always close to one of the replicas, but not all of them. The result we found is basically that TAPIR performs better in the multi-data center case except when the leader is in the same data center as the client. So it depends on whether you can always guarantee that the leader is in the same data center as the client.
The abort rate continues to essentially track the latency needed for commit. So, TAPIR reduces the abort rate compared to OCC because it reduces the commit latency. At very high contention, locking is likely to make slightly more progress, but no systems with strong consistency will be able to provide high performance. If you are interested in some other ways to optimize for the high contention case, take a look at our work on Claret: http://homes.cs.washington.edu/~bholt/projects/claret.html
We also tested with high clock skew. The paper notes, "with a clock skew of 50 ms, we saw less than 1% TAPIR retries." Since the clients can use the retry timestamps to sync their clocks, it only adds an extra round-trip, so it still leaves TAPIR with the same latency as a conventional system, even in cases of extremely high clock skew.