Hacker News new | ask | show | jobs
by sgsag33 1288 days ago
In my experience over time dev teams stop looking at that metrics on a board and all we really need are proper monitors/alarms that cover all relevant cases of our domain and writes a slack message or sends an email... KISS!
2 comments

I tend to find the Grafana dashboards solve a different issue to Prometheus's alerts and getting the balance right between the two really helps.

I'll use alerts are for things that need acting on in the immediate to short term (e.g. a server's load being very high for over 10 minutes or an SSL certificate will need renewing in the next couple of weeks).

And I'll use Grafana dashboards to provide insight in to how a system is running (e.g. graphing number of events triggered over time or a heat map of web servers response times).

As a team we've got a handful of Grafana dashboards that we look at after we've walked our Kanban board. It only takes a few minutes and we all get to know what the current state of our systems are, and over time, what their norm is as well.

Ability to see trends in prod data and fix issues before they appear => Prom/Grafana

Ability to detect when something is broken right now => Monit, nagios, etc

Ideally, you have both.