Hacker News new | ask | show | jobs
by SkipperCat 1763 days ago
Dashboards are invaluable. Humans can intake a lot of data from images and there is not better way to grok data than a graph.

We've spent a lot of time building Grafana dashboards and they've been extremely helpful with debugging. It doesn't solve all problems but it certainly helps narrow down where to look.

Sure, we still look at log files, use htop and a lot of other tools, but our first stop is always Grafana.

I suggest the almost any book by Edward Tufte. There you'll see the beauty and value of visual information.

4 comments

> We've spent a lot of time building Grafana dashboards and they've been extremely helpful with debugging. It doesn't solve all problems but it certainly helps narrow down where to look.

This is what I was about to write. Most of our services have 1 or 2 dashboards showing some service KPIs - for example HTTP request throughputh and response time, and also interface metrics to other sub systems - queries to postgres, messages to the message bus and so on.

With dashboards like this, you can very quickly build a deduction chain of "The customer opened a ticket, well because our 75%ile of the response time went to 20 seconds, well, because our database response times spiked to an ungodly number".

And then you can go to the dashboards about the database, and quickly narrow it down to the subsystem of the database - is something consuming the entire CPU of the database, is the IO blocked, is the network there slow.

In the happy cases, you can go from "The cluster is down" to "our database is blocked by a query" within a minute by looking at a few boards. That's very, very powerful and valuable.

And sure, at that point, the dashboards aren't useful anymore. But a map doesn't lose value because you can now see your target.

This is where dashboard have been most useful to me: extract and expose key metrics of the system, in as unbiased and raw of a form as possible.

Or, to put it another way, looking at a dashboard should tell you reliable facts about the system, that lead you to further exploration.

As the post puts it, "They’re great for getting a high level sense of system functioning, and tracking important stats over long intervals. They are a good starting point for investigations."

Dashboards should not attempt to interpret anything, without being very clear about how, why, and what they're doing.

Example: response time statistics vs "responsive green/red light"

If it's important enough to have logic built on top of it, that's an alert, and that's something different.

A dashboard in a car gives you largely active feedback to help you drive. Mechanics' diagnostic tools give them a deep dive into information about repairing the vehicle, not limited to real time or to just driving activity. Those two things are very different.

I don't think a debugging "dashboard" is a dashboard. That conflates two very different ideas. We need a different name for that.

What about the "check engine" light or the "low oil" light? Both are debugging indicators about the health of your engine.

But snark aside, I do agree with your about the intent of a dashboard. I've always told people dashboards are there to give you historical and real-time info about the performance and rates of your systems. It doesn't show you exactly what's wrong, but it is very helpful in showing where to start looking.

I don't think dashboards help with debugging at all, and they shouldn't be designed with that goal in mind.

Dashboards tell you that a problem exists. Just demonstrating that SOMETHING is broken means the dashboard has accomplished it's task in full. The engineer's job is then to fix it.

> We've spent a lot of time building Grafana dashboards and they've been extremely helpful with debugging. It doesn't solve all problems but it certainly helps narrow down where to look.

And then once the bugs that led to the creation of that dashboard are fixed or retired, what's left for that data? It just sits there with its pretty graphs and eye-catching visualizations to snare the unwary who are looking for help debugging a different problem. In fact, they'd be best served by ignoring existing dashboards and creating new ones specific to the issue in the present, not some dead husk of a problem that looks like it might be related.

How do things like message queue sizes, transactions per second, errors per minute, memory usage etc become less useful after solving one bug?
What if the next bug has nothing to do with the message queue? What if your last fix to improve the TPS ends up being sufficient headroom that the next problem never triggers a high rate? What if your errors per minute is only aggregating errors for the services you last had high error rates on?

The point: all those measures and the dashboard created to monitor them were likely put in place because of whatever prior bug or outage was traced to not knowing those metrics. But the next problem might be something else, for which the metrics are not collected, aggregated, or displayed. Now you've got a dashboard with lots of information, but it's not showing any problems, and it's not providing any insights into why your customers are complaining and all the product people are in fire drill mode.

This is like advocating to get rid of log statements because they might not log the cause of the next bug.
No, it's like advocating to get rid of log statements because they are like an overflowing sewer of data obscuring and distracting from the real cause of the bug. But you can't even figure out if there's a message in the log because it takes an hour to sort through the firehose of messages you know aren't relevant, but there they are, filling up your logs.

The above goes double if you're running Java and your logs include stacktraces spanning 300 lines.