Hacker News new | ask | show | jobs
by pjmorris 2296 days ago
> Similarly, if a human found an issue and alert didn't trip, I'd count that as a bug/missing feature in the monitoring.

The way that I took the GP's point was that humans can find things that haven't yet been automated, while automation can't (at least not yet, but I'd argue it'll take AGI for that.)

1 comments

Yes, I agree with this. But if you're relying on humans to look at dashboards to keep your actual service up in the moment, you're not seriously committing to automating (just like if you SSH to every machine you Terraform to tweak things, you're not really committed to Terraform).

What you should do is rely on automation to detect problems and alert people, and in postmortems, look at graphs and have humans say things like "Hey, this queue kept steadily climbing for three hours before the outage" or "We would have noticed it in this metric but it's so noisy so we can't alert on it" or something. Then you can write more automation (or focus on some prerequisite dev work).

I don't think anyone is arguing that, though. Lots of things humans notice e.g. "we speculatively upped the virtual file system cache and now the service has worse throughput but better high nines response time" is not something you can really build an alert for, and neither is it something you really want an alert for -- but absolutely something that would show up on a dashboard you're intimate with.

In other words, people are not arguing replacing alerts with humans, but rather arguing that continuously looking at your metrics give you a mental model for how your system behaviour changes in response to changes in configuration, whether intentional or not.