Hacker News new | ask | show | jobs
by rystsov 3834 days ago
> The only requirement is that if F is the number of unreachable nodes you wish to be able to tolerate, then your minimum cluster size is 2*F + 1. > Currently no other data store that I'm aware of offers this flexibility.

It isn't true. There're plenty of storages with same strong consistency model (tolerates F failures of 2F+1 nodes), among them are Cassandra with lightweight transactions, Riak with consistent buckets, CockroachDB and ZooKeeper.

> However, GoshawkDB has a radically different architecture to most databases and its object-store design means that contention on objects can be dramatically reduced, allowing the performance impact of strong-serializability to be minimised.

Can you please describe the algorithm that you use for implementing distributed transactions?

> just three network delays occur (and one fsync) before the outcome of the transaction is known to that node

3 round trips for per transaction is the result which is very similar to the Yabandeh-style transaction, see the Cockroach's blog for details - http://www.cockroachlabs.com/blog/how-cockroachdb-distribute...

Do you use the same approach?

1 comments

Hi. I'm the author of GoshawkDB. Thanks for your questions.

> It isn't true. There're plenty of storages with same strong consistency model (tolerates F failures of 2F+1 nodes), among them are Cassandra with lightweight transactions, Riak with consistent buckets, CockroachDB and ZooKeeper.

The point I was trying to make is the decoupling between cluster size and F. ZK, for example, requires that the cluster size is exactly 2F+1, not is minimum 2F+1. That means that as the cluster size increases, performance will decrease as more nodes have to be contacted.

> Can you please describe the algorithm that you use for implementing distributed transactions?

Yes. In detail, there will be a blog post forthcoming. However the two main points is that 2PC is replaced by Gray and Lamport's paper "Consensus on transaction Commit", and dependencies between transactions are modelled through vector clocks. As I'm visiting lots of family over xmas, I'm afraid I don't have the 12+ hours necessary to document the entire algorithm just at the moment. However, I will certainly get to this asap.

> Do you use the same approach?

I don't know. I don't use the term round-trip because that doesn't make it clear if we're talking individual messages or message-hops, and these sorts of conversation and discussion can often go wrong if we use different terminology. The 3 network-delays that I mention comes directly from the Gray and Lamport paper I mention above.

> The point I was trying to make is the decoupling between cluster size and F. ZK, for example, requires that the cluster size is exactly 2F+1, not is minimum 2F+1. That means that as the cluster size increases, performance will decrease as more nodes have to be contacted.

How does this work? Suppose we set F=2 and have 7 nodes, A-G. Client 1 tries to write, contacting 5 nodes, A-E. A and B are down at this point, but C, D and E accept the write so it calls it good. Then A and B come up and C and D go down. Client 2 tries to read, contacting nodes A, B, C, F, G. 4 out of the 5 respond successfully, so it calls it good. But it can't possibly read the value that client 1 wrote.

Whatever way you do it, the number of nodes contacted for a write + number of nodes contacted for a read has to add up to more than the cluster size. If your cluster is somehow sharded then you can work around this with e.g. consistent hashing (so that client 2 knows which 5 of the 7 nodes a particular key must be stored on, and contacts the correct subset), but that only works if client 2 can know what the key was - and I think Cassandra already supports that.

Ok, so F=2, so 2F+1 is 5. So client 1 creates a txn that writes to object x. x has replicas on A-E, but not F and G. So client 1's txn gets sent to F+1 drawn from A-E. In your scenario A and B are down so that only leaves C, D and E. So they vote on the txn and vote to Commit. This vote forms a Paxos round as per Gray and Lamport's paper "Consensus on transaction commit". Assuming that all these votes get received and written to disk by at least F+1 acceptors (which will almost certainly also be C, D, and E), and that C, D and E all form the exact same overall outcome then consensus has been achieved: the outcome is fixed and stable.

A and B come up at the same time that C and D go down. C, D and E all knew (in their role as acceptors, not voters) that the result of that txn has to go to _all_ of A-E. So when A and B come up, E will see this and send them the txn outcome. Now A and B can't progress at this point yet because they've only seen the outcome from one acceptor (E), not the minimum F+1. So as they spot C and D are down, they will start additional paxos rounds to try and get the txn aborted. Those rounds will _have_ to contact E, and they will find that actually the txn votes led to a commit. Now A and B will also play the role of acceptor, taking over from the failed C and D. Thus we'll finally have A B and E as acceptors all with the original commit votes, they will form the same outcome, and all the voters and learners will get the same result. I realise this is probably not the cleanest explanation, but this is the chief purpose of consensus algorithms in general: once a majority (F+1) of nodes form consensus about something, that consensus can never change, and can only be propagated further.

> Client 2 tries to read, contacting nodes A, B, C, F, G. 4 out of the 5 respond successfully, so it calls it good. But it can't possibly read the value that client 1 wrote.

Client 2 is trying to read object x and object x only exists on A-E, not F and G. So client 2 cannot ask F and G to take part. I assume you mistyped and meant A, B, and E, not C as I thought C and D were down at this point.

Client 2 will get A B and E to vote on this next transaction and that will occur as normal as we have only have 2 failures currently (C and D).

> Whatever way you do it, the number of nodes contacted for a write + number of nodes contacted for a read has to add up to more than the cluster size.

No, that's not true. It has to add up to more than the number of replicas of an object. That can be much less than the cluster size in GoshawkDB. This property is not true of several other data stores.

> If your cluster is somehow sharded then you can work around this with e.g. consistent hashing (so that client 2 knows which 5 of the 7 nodes a particular key must be stored on, and contacts the correct subset), but that only works if client 2 can know what the key was

Yes, this is exactly what GoshawkDB does (though my consistent hashing algorithms are quite unlike the norm).

> and I think Cassandra already supports that

Maybe - I'm certainly not a cassandra expert. I'm pretty sure cassandra does not support full transactions. For example: https://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml...

"Cassandra supports atomicity and isolation at the row-level, but trades transactional isolation and atomicity for high availability and fast write performance".

Which is fair enough - there's definitely a use case for the properties that Cassandra offers - as I've written on the GoshawkDB website, GoshawkDB is certainly not a competitor of Cassandra. GoshawkDB does support full transactions.

But isn't it the restriction to single-row transactions that makes it possible to isolate a given transaction to a particular subset of the cluster? If you allow a transaction across any set of rows, how can you possibly do consistent hashing of that, when the next transacting might involve a different (but maybe overlapping) set of rows?
Well I have no idea of how Cassandra works, so I can't comment on that.

GoshawkDB gets every object touched in the transaction to vote on the outcome of the transaction, in parallel, and guarantees that if two transactions touch the same object, there must be at least one copy of that object that gets to vote on both transactions, thus can enforce any dependencies as necessary.

Edit: further clarity: the use of consistent hashing in GoshawkDB is to ensure that each node has an equal number of object copies. It is the Paxos Synod algorithm that ensures a majority of those copies of any object get to vote on every transaction that touches that object, thus ensures that for any two transactions touching the same object, there is at least one copy of the object that gets to vote on both.

Ok. So in practice a multi-row transaction will have to contact >2F+1 nodes (2F+1 for each object involved, likely overlapping but not completely)?