Hacker News new | ask | show | jobs
by aristus 5913 days ago
It's probably to do with differences in scale. At the LAN level, he's right: partitioning is rare and C/A might make the most sense. At the WAN level, partitioning is a fact of life.

There are at least two problems to solve on opposite ends of scale: how to get many components inside one computer to cooperate without stepping all over each other, and how to get many computers to cooperate without drowning in coordination overhead. They may be special cases of a more general problem, and one solution will work for all. Or perhaps we'll have one kind of programming for the large and another for the small, just as the mechanics of life are different inside and outside of the cell.

1 comments

Agreed, but he's addressed WAN partitioning very well. Editing: "[Consider] a partition in a WAN network. There is enough redundancy engineered into today’s WANs that a partition is quite rare ... the most likely WAN failure is to separate a small portion of the network from the majority. In this case, the majority can continue with straightforward algorithms, and only the small portion must block. Hence, it seems unwise to give up consistency all the time in exchange for availability of a small subset of the nodes in a fairly rare scenario."

That last sentence is a very strong put-down of NoSQL; if it wasn't published on the ACM website, it should have a "zing!" at the end of it.

Actually he's not really addressing it at all. What stats does he have that show WAN partitions are rare? I find that short WAN partitions occur regularly. And how does the small portion "block?" How does it even know that it's the "small portion?" If a read comes into the "small portion" (maybe because the client's in the same data center) what does the small portion do? If it responds, the response may be inconsistent. If it doesn't, the data store is unavailable (at least as far as that client is concerned).
Paxos (which you'll probably use to implement a strongly consistent distributed system) allows nodes to reach consensus as long as a majority of them are able to talk to each other. So if a partition includes a majority of nodes, this partition will just continue working like before. The minority partition will not be able to reach any decisions at all and will be unavailable. Strictly speaking, the minority partition doesn't even know it's a minority partition, because this is an undecidable problem in asynchronous distributed systems. But the important point is: At most one partition will continue making decisions so that consistency is guaranteed.

So you're right in that the minority partition will be unavaiable, but I think it's a worthwhile tradeoff.

It _may_ be a worthwhile tradeoff. But if you're in three datacenters and one of them splits you could have a third of your requests fail. Depending on what you're doing, that may be unacceptable. I agree that it's a tradeoff. I disagree with the people who are saying that emphasizing consistency is always the right thing to do.
The problem is: If a node performs a write in an eventually-consistent data-store, the write will not be visible immediately. Alternatively (in a strongly consistent data-store) the node could just retry the write until it succeeds. From the perspective of a user that wouldn't really make much of a difference, but in the latter case there would at least be no risk of having multiple inconsistent versions of the same data.
Your first point is not true. In the absence of partitions, all "eventually consistent" data stores that I know of will give you strong consistency. The eventual consistency bit only comes into play if a problem occurs. (It's probably also worth noting that many RDBMS replication strategies won't give you strong consistency at all - even in ideal circumstances.)

I suppose you could retry if a write fails (e.g., can't reach a quorum), but you could theoretically end up retrying forever (and 10 seconds or so is forever as far as an interactive user is concerned)... eventually you need to either fail or write inconsistently. So we're just delaying the inevitable.

Also if you're accepting quorum writes to the "major partition" you still have to repair the "minor partition" when it comes back online. There's no traditional DB that implements the sort of read-repair/hinted-handoff/anti-entropy mechanics that Cassandra, Voldemort, and the various proprietary big data stores use.

Quorum protocols are the simple mechanism he's talking about. The argument he's making is that you shouldn't be so binary about saying 'well this subset of nodes is unavailable, so the system is not available'. That's why CAP is so overapplied in the real world: just because some fraction of your nodes are offline, that doesn't mean the whole system is offline.

Your sharing your statistics on WAN partitions happening regularly would be a welcome contribution to the debate. There's a hierarchy of failures, and I think it's generally accepted that WAN partitions happen less often than, say, one node in a cluster crashing. Statistics that show otherwise would let us talk in specifics rather than the abstract.

Sure, but quorum protocols only provide strong consistency in the absence of partitions. If a partition occurs you may not be able to get a quorum (where R + W > N) and, again, you're stuck with either being unavailable or potentially inconsistent. There's really no way around it... AFAIK it's a logical impossibility.

I'm not sure I get your argument re: CAP being overapplied. The key point the whole "AP" camp is making is exactly what you're saying - "just because some fraction of your nodes are offline, that doesn't mean the whole system is offline." What it does mean, though, is that some of your data may be stale. But eventually it won't be.

As for WAN partitions, I agree, they're not as frequent as single node failures. But as far as CAP is concerned it doesn't really matter. A partition is a partition, whether it's one node or half your cluster. The frequency that "WAN partitions" occurs depends on how you define a "WAN partition." If you consider a single lost TCP connection a short-lived partition (it pretty much is), or if you consider a DNS or power outage a WAN partition (in the sense that a whole cluster might disappear) then I think we can all come up with lots of ways WAN partitions can and do occur. I do agree that the entire Internet doesn't go down very often.

You choose your quorum trading off cost/complexity vs risk-tolerance. You ensure that not forming a quorum is impossible in scenarios that you care about. e.g. You may decide it's OK not to form a quorum if the entire USA power grid goes offline.

The broad problem is that you're trying to apply the mathematical proof of the CAP theorem to the real world. For example, the proof of the CAP theorem treats single-node failures as a case of network partitioning, which is logically elegant. But in the real world, it's just not realistic to consider a dropped TCP connection as equivalent to the failure of a datacenter, as you seem to be doing.

Er, no. I'm just not differentiating between the various reasons why a single node may be unavailable. It doesn't really matter _why_ the node is unavailable... it just is.

FWIW, databases like Cassandra expose the consistency tradeoff to the client. You can do quorum reads/writes with Cassandra. You can't with MySQL or PostgreSQL.

Edit: you can choose between quorum reads/writes and stronger or weaker consistency levels with Cassandra, but can't with MySQL / PostgreSQL.