| > I can replicate this failure in 2% to 5% of writes. Yeah, I'm curious about how you achieved those numbers. Your test that gets that 2-5% of writes (though your docs say 7.5%) to be messed up... what is really is measuring is the probability that out of 5 concurrent clients writing to 4 servers, at least two will finish writing to a row with the exact same timestamp... AND that they will be the LAST ones to write to that row. If just one of those clients ends up just a hair behind the other four, then you should register 0 collisions. What is even weirder is your benchmark takes 100 seconds to complete what amounts to 5000 writes, or averaging a rate of 50 writes per second, 10 writes per client per second. Those are pathetic numbers for a one node Cassandra cluster, let alone a four node one. WTF is going on here? Even more confusing, you are writing with ANY consistency, which means that in many cases those writes will be stamped and committed on different nodes, yet somehow getting the same timestamp. Odds on this seem... highly suspect. It almost seems like your clock only has 1 second resolution, which is weird. Have you checked the writetime timestamps on your records? I've done writes at much higher rates where we recorded the timestamps of every single write operation. We've yet to get the same timestamp on two operations. I also see Cassandra timeouts while writing with consistency ANY, yet are still somehow getting timeouts with this operation. That really screams to me that the cluster is truly messed up. Now, as you say, if you control the timestamps, you get collisions 99.9% of the time. I don't even get why it isn't just straight up 100% for that case. > Given that the whole point of isolation is to provide invariants during concurrent modification I think it is fair to say that you don't have transaction isolation if the timestamps are exactly the same. That is just an exceedingly low probability event unless you have a LOT of transactions per second. I'd dump the "writetime(a), writetime(b)" values to get an idea of what is going on there.... something smells and there is a lot less cardinality in those timestamps than I'd expect. |
Jepsen's code is open source (http://github.com/aphyr/jepsen) and I've written extensively about the techniques involved; see http://aphyr.com/tags/jepsen for details. Cassandra work is upcoming; no formal writeup yet.
Your test that gets that 2-5% of writes (though your docs say 7.5%) to be messed up...
Sorry, 2-5% was my mistake. Been playing around with the parameter space; numbers vary a bit.
what is really is measuring is the probability that out of 5 concurrent clients writing to 4 servers, at least two will finish writing to a row with the exact same timestamp... AND that they will be the LAST ones to write to that row. If just one of those clients ends up just a hair behind the other four, then you should register 0 collisions.
In a Dynamo system, a.) there is no such thing as "time", b.) there is no such thing as "last", and c.) causality tracking beats everything. Doesn't matter what order you do the writes in; timestamps (and/or vclocks in Voldemort/Riak) take precedence.
What is even weirder is your benchmark takes 100 seconds to complete what amounts to 5000 writes, or averaging a rate of 50 writes per second, 10 writes per client per second. Those are pathetic numbers for a one node Cassandra cluster, let alone a four node one. WTF is going on here?
Each client in the Jepsen test harness is (independently) scheduling n writes per second. Jepsen schedules its writes this way to a.) avoid measuring an overloaded system, b.) produce results which are somewhat comparable between runs, and c.) measure results over changing underlying dynamics--in this case, a network partition.
Even more confusing, you are writing with ANY consistency, which means that in many cases those writes will be stamped and committed on different nodes, yet somehow getting the same timestamp. Odds on this seem... highly suspect. It almost seems like your clock only has 1 second resolution, which is weird.
There's an interesting probability anecdote called the Birthday Paradox, which says that if you get 30 people in a room, chances are good that 2 will share the same birthday. At ten uniformly distributed writes a second, the probability of a timestamp collision is 0.44%... in any given second. Chances of a collision after a thousand seconds of runtime are 99.9999%. If you push 100 writes per second, collision probability is 50% in any second. If you push only 2 writes every second, you should expect to see a collision once every few days. How long-lived is that collision? It depends on the distribution of writes over time, and on the network, but you can work out a mean free path.
TL;DR: microsecond timestamps do not provide sufficient entropy for uniqueness constraints over common workloads.
I also see Cassandra timeouts while writing with consistency ANY, yet are still somehow getting timeouts with this operation. That really screams to me that the cluster is truly messed up.
The timeouts in this case are, I think, a Cassandra bug (or expected behavior) when partitions occur. Last I heard from jbellis, it wasn't clear what Cassandra should do under these conditions, but I think he was leaning towards allowing the local hint to count as success always.
Now, as you say, if you control the timestamps, you get collisions 99.9% of the time. I don't even get why it isn't just straight up 100% for that case.
The reason not all writes result in conflict with identical timestamps is, I suspect, due to that transitional period during the beginning of the network partition.