Hacker News new | ask | show | jobs
by geofft 2303 days ago
Yes, I agree with this. But if you're relying on humans to look at dashboards to keep your actual service up in the moment, you're not seriously committing to automating (just like if you SSH to every machine you Terraform to tweak things, you're not really committed to Terraform).

What you should do is rely on automation to detect problems and alert people, and in postmortems, look at graphs and have humans say things like "Hey, this queue kept steadily climbing for three hours before the outage" or "We would have noticed it in this metric but it's so noisy so we can't alert on it" or something. Then you can write more automation (or focus on some prerequisite dev work).

1 comments

I don't think anyone is arguing that, though. Lots of things humans notice e.g. "we speculatively upped the virtual file system cache and now the service has worse throughput but better high nines response time" is not something you can really build an alert for, and neither is it something you really want an alert for -- but absolutely something that would show up on a dashboard you're intimate with.

In other words, people are not arguing replacing alerts with humans, but rather arguing that continuously looking at your metrics give you a mental model for how your system behaviour changes in response to changes in configuration, whether intentional or not.