| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ori_b 3596 days ago
	The problem is that half the servers failing is indistinguishable from a switch connecting two racks being flaky, or anything else which can lead to half the servers being temporarily disconnected from the others.

2 comments

ketzu 3595 days ago

Classical consensus only solves the problem for up to 33% failure (3f+1 nodes, with f failures), having half of your servers fail can not be done with paxos.

link

f0urtyfive 3596 days ago

Why wouldn't you use redundant switches rather than n(2) servers?

link

ori_b 3595 days ago

Why would I buy two switches when I can solve the problem in general with a consensus algorithm that handles hosts going away for any reason?

Especially since a flaky switch isn't the only issue. Power loss could bring down even redundant backbone switches, but leave communication within the rack going. You could accidentally push a bad routing config and blackhole traffic to a bunch of hosts. You could send out a bad push that intermittently breaks connectivity to some hosts. And so on.

link