| HN Mirror

I tend to find the Grafana dashboards solve a different issue to Prometheus's alerts and getting the balance right between the two really helps.

I'll use alerts are for things that need acting on in the immediate to short term (e.g. a server's load being very high for over 10 minutes or an SSL certificate will need renewing in the next couple of weeks).

And I'll use Grafana dashboards to provide insight in to how a system is running (e.g. graphing number of events triggered over time or a heat map of web servers response times).

As a team we've got a handful of Grafana dashboards that we look at after we've walked our Kanban board. It only takes a few minutes and we all get to know what the current state of our systems are, and over time, what their norm is as well.