Hacker News new | ask | show | jobs
by wjossey 2901 days ago
One thing I'd encourage is to build a culture of looking at graphs daily as a matter of practice. Alerts are only as effective as the knowledge of the writer, and human based detection is almost-always the first step to better alerting.

By looking at graphs on a daily basis, one can become much more rapid in being able to diagnose issues quickly, and also understand what the root cause is of the underlying issue (assuming you're measuring XYZ that broke). So many times I was able to look at a graph and go, "Something isn't right", long before an alert would go off. I wasn't magical, I just had spent a little bit of time every day reviewing graphs, so that I could pick out subtle changes across a handful of graphs very rapidly, which would have also been hard to pre-emptively build checks around.

One additional thought is to not over-alert. Alert fatigue is your greatest enemy, and needs to be avoided at all costs. If you're not reacting to an alert, you should question whether or not you need it.

2 comments

> Alerts are only as effective as the knowledge of the writer

We're enforcing that alerts have a link to a playbook where there are more datails about the issue and how to mitigate them.

About over-alerting, we had already been over alerting before we began the taskforce, and one of our efforts is removing cloudwatch and riemann alerts that are usually noisy and do not offer a lot of insight on the real issues.

Any suggestions for getting graphs on a TV in the office? I usually have monitoring tools like New Relic open in my browser but I'd like to have these aggregated on a wall mounted TV for the team to look at together.