Hacker News new | ask | show | jobs
by james_cowling 3700 days ago
This is a really great question since the vast majority of the work was in ensuring correctness and reliability: everything from testing discipline to fault injection to auditing. This also included hardware testing, like pulling out circuit breakers to test our power distribution, or overheating a rack to test graceful shutdown.

I'll give a slightly lazy answer here however and point you to a talk I gave about building durable systems, which covers a lot of this material: https://www.oreilly.com/events/velocity/devops-web-performan...