|
|
|
|
|
by snewman
3808 days ago
|
|
Minimizing 2am issues, and maintaining an on-call rotation, aren't contradictory. There's no substitute for having someone on call; but you can minimize the number of times they actually get called. This topic is near and dear to my heart. You talk about testing - that's one side of the coin; the other side is careful alert tuning, (a) to minimize false positives at 2 AM, (b) to catch incipient issues while it's still business hours. (It's useful to think of alerts as just another phase of QA - the one that occurs after you hit production. The sooner you notice a problem, the less damage it causes, to both your customers and your sleep schedule.) At my workplace, we run a fairly complex system, but we've been able to keep nighttime pager incidents down to I think less than once per quarter, including false alarms. I can't remember the last one. The QA effort isn't overwhelming, either. See http://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/ if you're interested. |
|