|
|
|
|
|
by imbriaco
3795 days ago
|
|
Google "Netflix downtime" for evidence that Netflix also has outages. Google has outages, sometimes very significant ones of Google Apps. Facebook has outages. Complex systems fail. Period. All the time. Things like the Simian Army are fantastic tools that help you identify a host of problems and remediate them in advance, but they cannot test every combinatorial possibility in a complex distributed system. At the end of the day, the best defense is to have skilled people who are practiced at responding to problems. GitHub has those in spades, which is why they could respond to a widespread failure of their physical layer in just over 2 hours. The biggest win with the Simian Army isn't that it improves your redundancy. It's that it gives your people opportunities to _practice_ responses. |
|