Hacker News new | ask | show | jobs
by mechanical_fish 4273 days ago
Rackspace may have thought that they could mitigate the issue without taking this drastic step. They certainly didn't want to do global reboots.

The engineering team probably spent some time running tests and scribbling on whiteboards, trying to prove that the boat wasn't going to sink. In hindsight, they should have just sounded the klaxon and started handing out life jackets, but you know what they say about hindsight. And there are lots of reasons why the typical engineering organization struggles to accept the inevitable and call for an evacuation. Nobody likes Cassandra. Everybody wants to be a hero. Didn't you say this boat was unsinkable? It's hard to get all the decision-makers into one room. The show must go on. It isn't obvious that this complicated problem leads to our certain doom. Et cetera.

The key to making these things go smoothly is the Chaos Monkey, a.k.a. "conduct constant drills of your emergency responses". If you don't rehearse the response, you shy away from trying it. AWS halts or reboots EC2 instances all the time, and lo and behold, when it comes time to reboot all EC2 instances they don't flinch. Or they flinch less visibly, anyway.