Hacker News new | ask | show | jobs
by imbriaco 3795 days ago
Google "Netflix downtime" for evidence that Netflix also has outages. Google has outages, sometimes very significant ones of Google Apps. Facebook has outages.

Complex systems fail. Period. All the time. Things like the Simian Army are fantastic tools that help you identify a host of problems and remediate them in advance, but they cannot test every combinatorial possibility in a complex distributed system.

At the end of the day, the best defense is to have skilled people who are practiced at responding to problems. GitHub has those in spades, which is why they could respond to a widespread failure of their physical layer in just over 2 hours.

The biggest win with the Simian Army isn't that it improves your redundancy. It's that it gives your people opportunities to _practice_ responses.

2 comments

More than practicing responses, Chaos Monkey and Failure Injection Testing allow us to verify that we don't have unexpected hard dependencies. Sometimes you find out that your service can't start if another one becomes latent, in which case you can plan for it by adding redundancy/extra capacity, fallbacks or working in degraded mode.
I remember in 2013 a full-day outage of Google.