Hacker News new | ask | show | jobs
by srcreigh 1213 days ago
Does this practically improve the situation? The odds of two servers breaking at the same time for the same reasons seems very high. I actually can't think of a single example where the secondary sever would keep running.

Regression via a code or dependency update? Full disk? DNS is down? Too much load? All of these would bring down both servers in quick succession.

I guess something like a "once every 2 days" race condition could buy you some time if you had a 2nd server. But that's not a common error

2 comments

Zero downtime upgrades, hardware fault, aws decides that specific instance needs to die. It also doesn't let you cheat statelessness very easily, so it's easier to scale horizontally.
Fair enough I guess. I don’t think you need two servers to do zero downtime upgrades. And the other issues are, imo, beyond the 0.99 uptime threshold that most services realistically have when you add in breakage due to upgrades.

I like your statelessness point. I suppose in your view it’s better to have the concentrated stateful core with stateless servers as opposed to just one stateful instance. Two instances mean you can’t easily store foo in memory and hope the server doesn’t die until it’s not needed there anymore. Counterpoint is that the extra layer of indirection is 10x slower and horizontal scaling won’t be needed as much if you don’t pay that price in the first place, but you are right, the temptation to store foo in memory would still be in its prime. The thing is, if one machine can scale, putting foo in memory isn’t actually bad. It’s only when things don’t scale that it’s bad.

> I don’t think you need two servers to do zero downtime upgrades

Absolutely not and I can't understand why I keep hearing this argument. Doing zero downtime upgrades on a single server have been simple since basically forever, run another process on another port, change config, restart front balancer gracefully and there you go.

Sure, it can be done, but that alone isn't enough reason to give up redundancy.
We use 3 node MSSQL and it happens all the time where the primary gets in a bad state (100% cpu, high latency etc)and simply failing over to another instance fully recovers.

It could be bad hardware, it could be bad query (left dangling/canceled on old instance), could be bad statistics and unlocks disk fragmentation etc etc.