Hacker News new | ask | show | jobs
by WilTimSon 2161 days ago
Reporting accidents isn't the issue. It's having them on the regular and being past the standard operational period that worries me. Not sure why that's contentious.
1 comments

In an ideal world, you have your threshold for "incident" set low enough that you have a lot of them. Catching more of the distribution of problems let you increase the level of safety.
Exactly. This difference between "we're only going to update the status page if a customer comes screaming to us that the service has been unavailable for the past 24 hours and they've received no updates" and "we have automated systems in place to updated the status page if the p95 response time exceeds XXX (among other things)".