| > The only requirement is that if F is the number of unreachable nodes you wish to be able to tolerate, then your minimum cluster size is 2*F + 1.
> Currently no other data store that I'm aware of offers this flexibility. It isn't true. There're plenty of storages with same strong consistency model (tolerates F failures of 2F+1 nodes), among them are Cassandra with lightweight transactions, Riak with consistent buckets, CockroachDB and ZooKeeper. > However, GoshawkDB has a radically different architecture to most databases and its object-store design means that contention on objects can be dramatically reduced, allowing the performance impact of strong-serializability to be minimised. Can you please describe the algorithm that you use for implementing distributed transactions? > just three network delays occur (and one fsync) before the outcome of the transaction is known to that node 3 round trips for per transaction is the result which is very similar to the Yabandeh-style transaction, see the Cockroach's blog for details - http://www.cockroachlabs.com/blog/how-cockroachdb-distribute... Do you use the same approach? |
> It isn't true. There're plenty of storages with same strong consistency model (tolerates F failures of 2F+1 nodes), among them are Cassandra with lightweight transactions, Riak with consistent buckets, CockroachDB and ZooKeeper.
The point I was trying to make is the decoupling between cluster size and F. ZK, for example, requires that the cluster size is exactly 2F+1, not is minimum 2F+1. That means that as the cluster size increases, performance will decrease as more nodes have to be contacted.
> Can you please describe the algorithm that you use for implementing distributed transactions?
Yes. In detail, there will be a blog post forthcoming. However the two main points is that 2PC is replaced by Gray and Lamport's paper "Consensus on transaction Commit", and dependencies between transactions are modelled through vector clocks. As I'm visiting lots of family over xmas, I'm afraid I don't have the 12+ hours necessary to document the entire algorithm just at the moment. However, I will certainly get to this asap.
> Do you use the same approach?
I don't know. I don't use the term round-trip because that doesn't make it clear if we're talking individual messages or message-hops, and these sorts of conversation and discussion can often go wrong if we use different terminology. The 3 network-delays that I mention comes directly from the Gray and Lamport paper I mention above.