| Strongly disagree. Understanding your metrics is a key part of so many roles, from devops, to product teams, to marketers... Yes, you should be automating alerts whenever possible. Yes, you should be putting up key metrics in a visible place so everyone can see how the product is performing. I can’t tell you how many times I caught an issue because I knew our metrics backwards and forwards, but it didn’t trip an alert threshold. Not every issue follows a pattern easily defined in a check, and human brains are incredible computers capable of helping to fill in that gap. |
So how many times was an issue missed because you weren't in the office, or because you were looking at your own screen and not dashboards at the moment?
Humans are incredibly powerful, but our whole job as SREs is to make things reliable, repeatable, and scalable. We're doing an industry-wide migration from elegantly hand-crafted LAMP stacks running SSH to Kubernetes and infrastructure-as-code, not because you can't fix problems with SSH (you can, and you can usually fix them faster and better) but because you can't scalably fix problems with SSH. Similarly, if a human found an issue and alert didn't trip, I'd count that as a bug/missing feature in the monitoring.
It's valuable while you're still small and working out your monitoring to keep a human in the loop - but at some point you need to get rid of that single point of failure. By all means, rely on a human to figure out where your alerting is lacking (just like you rely on a human to write the infrastructure-as-code), but you should eventually not rely on human intervention to actually keep incidents from happening.