| HN Mirror

> Combining them into one graph will likely conceal the difference in the two cases, as you describe

Indeed. The first order issue is locating the problem though.

If you don't spot which of your microservices is the culprit due to only looking at successful latency, you're not going to get to the stage of comparing successful vs failed latency (and in practice, the increased error ratio combined with increased overall latency should tip you off).

> unless you feed them in to a system that an natively tease them apart as easily as show them together

And the user actually thinks to perform that additional analysis.

> So it seems like the disagreement is more about visualization than collection

What I've seen happen is that the collection leads to the visualisation, which subsequently leads to prolonged outages due to misunderstanding.

Thus I suggest removing the risk of the issue on the visualisation end, by eliminating the problem at the collection stage. This is particularly important when the people doing the visualisation aren't the same people writing the collection code, and thus don't know if the people creating the dashboards will all be sufficiently operationally sophisticated.

> to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualisation of that data makes clear the separation.

It's a little messier than that. Depending exactly on how the data is collected, such a split could make some analyses more difficult or impossible. For example I need the overall latency increase in order to see if this server is entirely responsible for the overall latency increase I see one level up in the stack, or if there's some other or additional problem that needs explanation. There's no equivalent math for success or failure.

Put another way, the math on the overall works the way your intuition thinks it does. The split out version is more subtle.

>(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)

I work on Prometheus which is a metrics system. Honeycomb seems to be based on event logs. There's logic to removing the success/failure split for duration metrics as I suggest, but it'd be insanity to remove it for event logs. So in your case it is purely a visualisation problem, whereas for us losing granularity at the collection stage is an option (and sometimes required on cardinality grounds).

The terminology the article uses (incrementing a counter at an instrumentation point) led me to believe we were discussing only metrics.

The way I would see things is that you'd use a metrics-based like Prometheus to locate and understand the general problem and which subsystems are involved, and then start using log-based tools like Honeycomb as you dig further in to see which exact requests are at fault. They're complementary tools with different tradeoffs.

I've written about this in more depth at http://thenewstack.io/classes-container-monitoring/