Hacker News new | ask | show | jobs
by detaro 3793 days ago
Stuff happens, and even if you test all kinds of things real failure situations always can work differently, with partial failures etc. Just takes one important subsystem hitting an unforeseen edge case, and going completely down is in many cases better than risking running in a state that destroys data or does other bad things. Same for taking your time to go back online.

The cases that work are not the ones you hear about. Best practices and testing reduce the risk of making the news, but can't guarantee success.