| > We've spent a lot of time building Grafana dashboards and they've been extremely helpful with debugging. It doesn't solve all problems but it certainly helps narrow down where to look. This is what I was about to write. Most of our services have 1 or 2 dashboards showing some service KPIs - for example HTTP request throughputh and response time, and also interface metrics to other sub systems - queries to postgres, messages to the message bus and so on. With dashboards like this, you can very quickly build a deduction chain of "The customer opened a ticket, well because our 75%ile of the response time went to 20 seconds, well, because our database response times spiked to an ungodly number". And then you can go to the dashboards about the database, and quickly narrow it down to the subsystem of the database - is something consuming the entire CPU of the database, is the IO blocked, is the network there slow. In the happy cases, you can go from "The cluster is down" to "our database is blocked by a query" within a minute by looking at a few boards. That's very, very powerful and valuable. And sure, at that point, the dashboards aren't useful anymore. But a map doesn't lose value because you can now see your target. |
Or, to put it another way, looking at a dashboard should tell you reliable facts about the system, that lead you to further exploration.
As the post puts it, "They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations."
Dashboards should not attempt to interpret anything, without being very clear about how, why, and what they're doing.
Example: response time statistics vs "responsive green/red light"
If it's important enough to have logic built on top of it, that's an alert, and that's something different.