Hacker News new | ask | show | jobs
by pedalpete 3787 days ago
Does Github run anything like Netflix Simbian Army against it's services? As a company by engineers for engineers with the scale that github has reached, I'm a bit surprised they are lacking a bit more redundancy. Though they may not need the uptime of netflix, an outage of more than a few minutes on github could affect businesses that rely on the service.
2 comments

Google "Netflix downtime" for evidence that Netflix also has outages. Google has outages, sometimes very significant ones of Google Apps. Facebook has outages.

Complex systems fail. Period. All the time. Things like the Simian Army are fantastic tools that help you identify a host of problems and remediate them in advance, but they cannot test every combinatorial possibility in a complex distributed system.

At the end of the day, the best defense is to have skilled people who are practiced at responding to problems. GitHub has those in spades, which is why they could respond to a widespread failure of their physical layer in just over 2 hours.

The biggest win with the Simian Army isn't that it improves your redundancy. It's that it gives your people opportunities to _practice_ responses.

More than practicing responses, Chaos Monkey and Failure Injection Testing allow us to verify that we don't have unexpected hard dependencies. Sometimes you find out that your service can't start if another one becomes latent, in which case you can plan for it by adding redundancy/extra capacity, fallbacks or working in degraded mode.
I remember in 2013 a full-day outage of Google.
It's "simian army". A simbian army is like a herd of dildos to sit on. I doubt that would have helped github's services recover faster.
You're thinking of sybian army.

I'm really tempted to continue with "a simbian army is actually" but this isn't Reddit so end of comment thread.