|
|
|
|
|
by justinsb
5913 days ago
|
|
Quorum protocols are the simple mechanism he's talking about. The argument he's making is that you shouldn't be so binary about saying 'well this subset of nodes is unavailable, so the system is not available'. That's why CAP is so overapplied in the real world: just because some fraction of your nodes are offline, that doesn't mean the whole system is offline. Your sharing your statistics on WAN partitions happening regularly would be a welcome contribution to the debate. There's a hierarchy of failures, and I think it's generally accepted that WAN partitions happen less often than, say, one node in a cluster crashing. Statistics that show otherwise would let us talk in specifics rather than the abstract. |
|
I'm not sure I get your argument re: CAP being overapplied. The key point the whole "AP" camp is making is exactly what you're saying - "just because some fraction of your nodes are offline, that doesn't mean the whole system is offline." What it does mean, though, is that some of your data may be stale. But eventually it won't be.
As for WAN partitions, I agree, they're not as frequent as single node failures. But as far as CAP is concerned it doesn't really matter. A partition is a partition, whether it's one node or half your cluster. The frequency that "WAN partitions" occurs depends on how you define a "WAN partition." If you consider a single lost TCP connection a short-lived partition (it pretty much is), or if you consider a DNS or power outage a WAN partition (in the sense that a whole cluster might disappear) then I think we can all come up with lots of ways WAN partitions can and do occur. I do agree that the entire Internet doesn't go down very often.