|
> I can’t tell you how many times I caught an issue because I knew our metrics backwards and forwards, but it didn’t trip an alert threshold. So how many times was an issue missed because you weren't in the office, or because you were looking at your own screen and not dashboards at the moment? Humans are incredibly powerful, but our whole job as SREs is to make things reliable, repeatable, and scalable. We're doing an industry-wide migration from elegantly hand-crafted LAMP stacks running SSH to Kubernetes and infrastructure-as-code, not because you can't fix problems with SSH (you can, and you can usually fix them faster and better) but because you can't scalably fix problems with SSH. Similarly, if a human found an issue and alert didn't trip, I'd count that as a bug/missing feature in the monitoring. It's valuable while you're still small and working out your monitoring to keep a human in the loop - but at some point you need to get rid of that single point of failure. By all means, rely on a human to figure out where your alerting is lacking (just like you rely on a human to write the infrastructure-as-code), but you should eventually not rely on human intervention to actually keep incidents from happening. |
Instrumentation and alerts are vital - they leverage inhuman persistence, patience and low cost. But alerts do not substitute for a deep understanding of how your systems work.
A number of the more useful "pre-crime" alerts we have derived from that - if I hadn't been elbow-deep in our systems long enough to notice certain behaviors have non-obvious second- and third-order effects downstream, we wouldn't have the alerts at all.