|
|
|
|
|
by geofft
2291 days ago
|
|
So, I'm making a bit of a subtle claim - you should absolutely be elbow-deep in your systems, and you should be understanding things well enough to build these sorts of proactive alerts, but you shouldn't rely on people being elbow-deep for noticing problems in real time. If you're ever at the point where you catch a problem and automated monitoring didn't, that's a bug in automated monitoring. If you are really good at finding new bugs in automated monitoring and more things to monitor because you're spending your time getting a sense of how the system behaves, that's fantastic, keep doing that. (That is one of the good reasons for dashboards IMO - a bunch of data to look at when you've already realized something's wrong. Just don't use dashboards to make the decision that something must be wrong.) If you don't improve your automated monitoring and you're worried things will start failing without humans watching dashboards, then you're not solving your existing bugs. |
|
I completely and unreservedly agree.
> that's a bug in automated monitoring
As part of incident review, we explicitly added a "review monitor performance" step. My favorite part is that the number of times monitors are created, adjusted or complained about post-incident is in itself a highly useful datapoint.