Excellent as usual. One question though. When refuting the rebuttal's claim:
>Moreover, if we test the same workload on a simple single instance InnoDB, we will get the same result.
He simply states that he tested it and it didn't happen, so it must be OK. But later on says the x=x+y form perhaps only accidentally works as "I suspect that this only passes because the window of concurrency for the read/write cycle is very short".
Why wouldn't this be true of the single-node scenario? Certainly things are faster on a single node, so unless Jepsen is pre-empting and delaying each thread-instruction combination, a test passing certainly isn't proof of it actually being OK, right? It's possible that the same workload on a heavily overloaded instance, might, somehow, exhibit the same behaviour, eh? I understand that the next sentence goes on to attribute it to a likely problems with Galera's locks. But the test alone shouldn't be conclusive evidence. Not trying to be pedantic or discuss what "proof" means just curious as to if I missed something.
The #1 issue is that the "2 out of 3" formulation is utterly misleading, as it sounds as if all the three options are feature of the distributed system. In reality, people don't control "P" - it's a feature of the network. You may be very careful about designing and operating your network, but one day a partition will happen. So "CA" systems are nonsense.
The other problem is that while CAP uses the same basic terms as ACID, the meaning of those terms is entirely different. This makes discussions about the CAP vs. ACID stuff pretty much impossible, because each group thinks Consistency or Availability means the same thing in both worlds. So RDBMS people will shout "We have availability, just like you!" and NoSQL people will shout "We have consistency, just like you!" Madness ...
Good point, but I wasn't implying that the "A" in ACID stands for Availability (sorry if that was unclear), but that what people dealing with traditional systems (build on ACID) mean when they speak about "Availability".
For example it may mean a master-slave system with an automatic failover, which is rather different from what Availability means in CAP.
I had high hopes in Percona guys with the terrific job they are doing in the MySql space. In our experience, Percona server is working great in production and percona toolkit is a life saver for us.
My very personal takeaway from all these aphyr tests: Distributed systems are hard. Out of the box, very few work properly and probably none of them will work as expected because of a faulty deployment. Think twice before deploying one and consider other ways to scale your system.
In our case, we are going with the memcached approach: client-side sharding with some kind of replication in the backend (mysql master-slave, drdb, etc). Far easier to deploy and, what is more important, far easier to recover when something goes wrong.
Discussion of Aphyr's MariaDB Galera Cluster post: https://news.ycombinator.com/item?id=10171942
Discussion of Percona’s response to it: https://news.ycombinator.com/item?id=10238690