Hacker News new | ask | show | jobs
by _jal 2291 days ago
You're both right.

Instrumentation and alerts are vital - they leverage inhuman persistence, patience and low cost. But alerts do not substitute for a deep understanding of how your systems work.

A number of the more useful "pre-crime" alerts we have derived from that - if I hadn't been elbow-deep in our systems long enough to notice certain behaviors have non-obvious second- and third-order effects downstream, we wouldn't have the alerts at all.

1 comments

So, I'm making a bit of a subtle claim - you should absolutely be elbow-deep in your systems, and you should be understanding things well enough to build these sorts of proactive alerts, but you shouldn't rely on people being elbow-deep for noticing problems in real time.

If you're ever at the point where you catch a problem and automated monitoring didn't, that's a bug in automated monitoring. If you are really good at finding new bugs in automated monitoring and more things to monitor because you're spending your time getting a sense of how the system behaves, that's fantastic, keep doing that. (That is one of the good reasons for dashboards IMO - a bunch of data to look at when you've already realized something's wrong. Just don't use dashboards to make the decision that something must be wrong.) If you don't improve your automated monitoring and you're worried things will start failing without humans watching dashboards, then you're not solving your existing bugs.

> but you shouldn't rely on people being elbow-deep for noticing problems in real time.

I completely and unreservedly agree.

> that's a bug in automated monitoring

As part of incident review, we explicitly added a "review monitor performance" step. My favorite part is that the number of times monitors are created, adjusted or complained about post-incident is in itself a highly useful datapoint.