|
|
|
|
|
by bbrazil
3430 days ago
|
|
> A histogram of the duration it took to serve a response to a request, also labelled by successes or errors. I recommend against this, rather have one overall duration metric and another metric tracking a count of failures. The reason for this is that very often just the success latency will end up being graphed, and high overall latency due to timing-out failed requests will be missed. The more information you put on a dashboard, the more chance someone will miss a subtlety like this in the interpretation. Particularly if debugging distributed systems isn't their forte, or they've been woken up in the middle of the night by a page. This guide only covers what I'd consider online serving systems, I'd suggest a look at the Prometheus instrumentation guidelines on what sort of things to monitor for other types of systems: https://prometheus.io/docs/practices/instrumentation/ |
|
The originally quoted advice, to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualization of that data makes clear the separation.
I absolutely agree that careful consideration is required when choosing what to put on dashboards to avoid confusion. That seems to be a separate issue.
(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)