Hacker News new | ask | show | jobs
by cratermoon 1763 days ago
> We've spent a lot of time building Grafana dashboards and they've been extremely helpful with debugging. It doesn't solve all problems but it certainly helps narrow down where to look.

And then once the bugs that led to the creation of that dashboard are fixed or retired, what's left for that data? It just sits there with its pretty graphs and eye-catching visualizations to snare the unwary who are looking for help debugging a different problem. In fact, they'd be best served by ignoring existing dashboards and creating new ones specific to the issue in the present, not some dead husk of a problem that looks like it might be related.

2 comments

How do things like message queue sizes, transactions per second, errors per minute, memory usage etc become less useful after solving one bug?
What if the next bug has nothing to do with the message queue? What if your last fix to improve the TPS ends up being sufficient headroom that the next problem never triggers a high rate? What if your errors per minute is only aggregating errors for the services you last had high error rates on?

The point: all those measures and the dashboard created to monitor them were likely put in place because of whatever prior bug or outage was traced to not knowing those metrics. But the next problem might be something else, for which the metrics are not collected, aggregated, or displayed. Now you've got a dashboard with lots of information, but it's not showing any problems, and it's not providing any insights into why your customers are complaining and all the product people are in fire drill mode.

This is like advocating to get rid of log statements because they might not log the cause of the next bug.
No, it's like advocating to get rid of log statements because they are like an overflowing sewer of data obscuring and distracting from the real cause of the bug. But you can't even figure out if there's a message in the log because it takes an hour to sort through the firehose of messages you know aren't relevant, but there they are, filling up your logs.

The above goes double if you're running Java and your logs include stacktraces spanning 300 lines.