| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by f0urtyfive 3596 days ago
	Once you're planning work arounds for the "assume half your servers fail" scenario, I think it's time to admit that eventually, something is going to go down that is not automatically recoverable.

1 comments

ori_b 3596 days ago

The problem is that half the servers failing is indistinguishable from a switch connecting two racks being flaky, or anything else which can lead to half the servers being temporarily disconnected from the others.

link

ketzu 3595 days ago

Classical consensus only solves the problem for up to 33% failure (3f+1 nodes, with f failures), having half of your servers fail can not be done with paxos.

link

f0urtyfive 3595 days ago

Why wouldn't you use redundant switches rather than n(2) servers?

link

ori_b 3595 days ago

Why would I buy two switches when I can solve the problem in general with a consensus algorithm that handles hosts going away for any reason?

Especially since a flaky switch isn't the only issue. Power loss could bring down even redundant backbone switches, but leave communication within the rack going. You could accidentally push a bad routing config and blackhole traffic to a bunch of hosts. You could send out a bad push that intermittently breaks connectivity to some hosts. And so on.

link