| Hi. I'm the author of GoshawkDB. Thanks for your questions. > It isn't true. There're plenty of storages with same strong consistency model (tolerates F failures of 2F+1 nodes), among them are Cassandra with lightweight transactions, Riak with consistent buckets, CockroachDB and ZooKeeper. The point I was trying to make is the decoupling between cluster size and F. ZK, for example, requires that the cluster size is exactly 2F+1, not is minimum 2F+1. That means that as the cluster size increases, performance will decrease as more nodes have to be contacted. > Can you please describe the algorithm that you use for implementing distributed transactions? Yes. In detail, there will be a blog post forthcoming. However the two main points is that 2PC is replaced by Gray and Lamport's paper "Consensus on transaction Commit", and dependencies between transactions are modelled through vector clocks. As I'm visiting lots of family over xmas, I'm afraid I don't have the 12+ hours necessary to document the entire algorithm just at the moment. However, I will certainly get to this asap. > Do you use the same approach? I don't know. I don't use the term round-trip because that doesn't make it clear if we're talking individual messages or message-hops, and these sorts of conversation and discussion can often go wrong if we use different terminology. The 3 network-delays that I mention comes directly from the Gray and Lamport paper I mention above. |
How does this work? Suppose we set F=2 and have 7 nodes, A-G. Client 1 tries to write, contacting 5 nodes, A-E. A and B are down at this point, but C, D and E accept the write so it calls it good. Then A and B come up and C and D go down. Client 2 tries to read, contacting nodes A, B, C, F, G. 4 out of the 5 respond successfully, so it calls it good. But it can't possibly read the value that client 1 wrote.
Whatever way you do it, the number of nodes contacted for a write + number of nodes contacted for a read has to add up to more than the cluster size. If your cluster is somehow sharded then you can work around this with e.g. consistent hashing (so that client 2 knows which 5 of the 7 nodes a particular key must be stored on, and contacts the correct subset), but that only works if client 2 can know what the key was - and I think Cassandra already supports that.