Hacker News new | ask | show | jobs
by ori_b 3596 days ago
The problem is that half the servers failing is indistinguishable from a switch connecting two racks being flaky, or anything else which can lead to half the servers being temporarily disconnected from the others.
2 comments

Classical consensus only solves the problem for up to 33% failure (3f+1 nodes, with f failures), having half of your servers fail can not be done with paxos.
Why wouldn't you use redundant switches rather than n(2) servers?
Why would I buy two switches when I can solve the problem in general with a consensus algorithm that handles hosts going away for any reason?

Especially since a flaky switch isn't the only issue. Power loss could bring down even redundant backbone switches, but leave communication within the rack going. You could accidentally push a bad routing config and blackhole traffic to a bunch of hosts. You could send out a bad push that intermittently breaks connectivity to some hosts. And so on.