| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by justinsb 5913 days ago
	Quorum protocols are the simple mechanism he's talking about. The argument he's making is that you shouldn't be so binary about saying 'well this subset of nodes is unavailable, so the system is not available'. That's why CAP is so overapplied in the real world: just because some fraction of your nodes are offline, that doesn't mean the whole system is offline. Your sharing your statistics on WAN partitions happening regularly would be a welcome contribution to the debate. There's a hierarchy of failures, and I think it's generally accepted that WAN partitions happen less often than, say, one node in a cluster crashing. Statistics that show otherwise would let us talk in specifics rather than the abstract.

1 comments

mmalone 5913 days ago

Sure, but quorum protocols only provide strong consistency in the absence of partitions. If a partition occurs you may not be able to get a quorum (where R + W > N) and, again, you're stuck with either being unavailable or potentially inconsistent. There's really no way around it... AFAIK it's a logical impossibility.

I'm not sure I get your argument re: CAP being overapplied. The key point the whole "AP" camp is making is exactly what you're saying - "just because some fraction of your nodes are offline, that doesn't mean the whole system is offline." What it does mean, though, is that some of your data may be stale. But eventually it won't be.

As for WAN partitions, I agree, they're not as frequent as single node failures. But as far as CAP is concerned it doesn't really matter. A partition is a partition, whether it's one node or half your cluster. The frequency that "WAN partitions" occurs depends on how you define a "WAN partition." If you consider a single lost TCP connection a short-lived partition (it pretty much is), or if you consider a DNS or power outage a WAN partition (in the sense that a whole cluster might disappear) then I think we can all come up with lots of ways WAN partitions can and do occur. I do agree that the entire Internet doesn't go down very often.

link

justinsb 5913 days ago

You choose your quorum trading off cost/complexity vs risk-tolerance. You ensure that not forming a quorum is impossible in scenarios that you care about. e.g. You may decide it's OK not to form a quorum if the entire USA power grid goes offline.

The broad problem is that you're trying to apply the mathematical proof of the CAP theorem to the real world. For example, the proof of the CAP theorem treats single-node failures as a case of network partitioning, which is logically elegant. But in the real world, it's just not realistic to consider a dropped TCP connection as equivalent to the failure of a datacenter, as you seem to be doing.

link

mmalone 5913 days ago

Er, no. I'm just not differentiating between the various reasons why a single node may be unavailable. It doesn't really matter _why_ the node is unavailable... it just is.

FWIW, databases like Cassandra expose the consistency tradeoff to the client. You can do quorum reads/writes with Cassandra. You can't with MySQL or PostgreSQL.

Edit: you can choose between quorum reads/writes and stronger or weaker consistency levels with Cassandra, but can't with MySQL / PostgreSQL.

link

justinsb 5913 days ago

I'm not going to treat a cosmic ray corrupting one single network packet the same way I treat a hurricane cutting off power to a datacenter for 2 weeks. I do see the intellectual appeal in doing so, but we'll just have to agree to disagree!

link

mmalone 5913 days ago

Uhm, of course they're not the same thing... but they have the same effect. The point is that the system remains available even if a node becomes unavailable for _whatever_ reason. I'm not sure what you're disagreeing on... There are common and uncommon modes of failure. Of course we should prioritize handling the common ones. But if we can handle all of them at once that's ideal. And, as I said in an earlier comment, when you're doing a million operations a second, failures that are one-in-a-million happen every second.

link