Hacker News new | ask | show | jobs
by cx01 5913 days ago
Paxos (which you'll probably use to implement a strongly consistent distributed system) allows nodes to reach consensus as long as a majority of them are able to talk to each other. So if a partition includes a majority of nodes, this partition will just continue working like before. The minority partition will not be able to reach any decisions at all and will be unavailable. Strictly speaking, the minority partition doesn't even know it's a minority partition, because this is an undecidable problem in asynchronous distributed systems. But the important point is: At most one partition will continue making decisions so that consistency is guaranteed.

So you're right in that the minority partition will be unavaiable, but I think it's a worthwhile tradeoff.

1 comments

It _may_ be a worthwhile tradeoff. But if you're in three datacenters and one of them splits you could have a third of your requests fail. Depending on what you're doing, that may be unacceptable. I agree that it's a tradeoff. I disagree with the people who are saying that emphasizing consistency is always the right thing to do.
The problem is: If a node performs a write in an eventually-consistent data-store, the write will not be visible immediately. Alternatively (in a strongly consistent data-store) the node could just retry the write until it succeeds. From the perspective of a user that wouldn't really make much of a difference, but in the latter case there would at least be no risk of having multiple inconsistent versions of the same data.
Your first point is not true. In the absence of partitions, all "eventually consistent" data stores that I know of will give you strong consistency. The eventual consistency bit only comes into play if a problem occurs. (It's probably also worth noting that many RDBMS replication strategies won't give you strong consistency at all - even in ideal circumstances.)

I suppose you could retry if a write fails (e.g., can't reach a quorum), but you could theoretically end up retrying forever (and 10 seconds or so is forever as far as an interactive user is concerned)... eventually you need to either fail or write inconsistently. So we're just delaying the inevitable.

Also if you're accepting quorum writes to the "major partition" you still have to repair the "minor partition" when it comes back online. There's no traditional DB that implements the sort of read-repair/hinted-handoff/anti-entropy mechanics that Cassandra, Voldemort, and the various proprietary big data stores use.

The case that I have to retry indefinitely will only occur when the partition is permanent. And in that case an eventually consistent system will not help either, because clients will not see the writes (as long as the partition exists), so the eventual consistent system doesn't offer me any advantage.
My point was that for an interactive application it doesn't take a very long time for user experience to degrade substantially.

And we've glossed over my points about newer databases exposing these knobs to clients. It's possible to do exactly what you're describing using Cassandra, for example (er, you might have to do some consistency level tweaking to make writes fail if you can't get a quorum of authoritative nodes, but it wouldn't be hard - and I'm not even sure if that's necessary). It's not possible to do it with MySQL or PostgreSQL without building some intelligent partitioning layer on top. And that layer will probably make it impossible to do joins and add relationship constraints, so you lose any benefits these systems bring to the table.