Hacker News new | ask | show | jobs
by bbrazil 3430 days ago
> A histogram of the duration it took to serve a response to a request, also labelled by successes or errors.

I recommend against this, rather have one overall duration metric and another metric tracking a count of failures.

The reason for this is that very often just the success latency will end up being graphed, and high overall latency due to timing-out failed requests will be missed.

The more information you put on a dashboard, the more chance someone will miss a subtlety like this in the interpretation. Particularly if debugging distributed systems isn't their forte, or they've been woken up in the middle of the night by a page.

This guide only covers what I'd consider online serving systems, I'd suggest a look at the Prometheus instrumentation guidelines on what sort of things to monitor for other types of systems: https://prometheus.io/docs/practices/instrumentation/

2 comments

By creating events that contain both the duration of the request and whether it succeeded, you can create graphs that show you the detail you need. Unless you include those data together at the beginning, it will be impossible to tease them apart later on. Combining them into one graph will likely conceal the difference in the two cases, as you describe, unless you feed them in to a system that an natively tease them apart as easily as show them together (such as http://honeycomb.io). So it seems like the disagreement is more about visualization than collection (the section of the blog in which that quote appears).

The originally quoted advice, to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualization of that data makes clear the separation.

I absolutely agree that careful consideration is required when choosing what to put on dashboards to avoid confusion. That seems to be a separate issue.

(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)

In practice, if your system if complicated and you have to look at the visualization, you are already in trouble. For anything complicated, you need exactly the inputs you describe, but everything has to be processed already by another layer that can give you higher level ideas.

This is a place where I think you guys could beat what other 3rd party monitoring tools are doing. I work with some of your guest bloggers, and I work on a subsystem with its own dashboard: about 50 charts. To make bringing new teammates a sensible experience, we need both a layer of alerts on top of the charts, and then a set of rules of thumb, that should be programmed if the alerting system was good enough, that put the alerts together into realistic failure cases: if X and Y triggered, but Z didn't, then chances are this piece is probably the culprit.

There's also opportunities in visualizations that aren't chart based: We used to have something like that for another complex system in another employer, but that's expensive, custom work, unless you join forces with something that understands were all your services are, knows all ingress and egress rules, and thus could automatically generate a picture of your system, along with understanding the instrumentation: So leave that until you merge with SkylinerHQ or something.

That said, I think you guys are heading towards a good, marketable product as it is. Fixing the annoying the statsd/splunk divide of older monitoring would probably make us buy it already.

> Combining them into one graph will likely conceal the difference in the two cases, as you describe

Indeed. The first order issue is locating the problem though.

If you don't spot which of your microservices is the culprit due to only looking at successful latency, you're not going to get to the stage of comparing successful vs failed latency (and in practice, the increased error ratio combined with increased overall latency should tip you off).

> unless you feed them in to a system that an natively tease them apart as easily as show them together

And the user actually thinks to perform that additional analysis.

> So it seems like the disagreement is more about visualization than collection

What I've seen happen is that the collection leads to the visualisation, which subsequently leads to prolonged outages due to misunderstanding.

Thus I suggest removing the risk of the issue on the visualisation end, by eliminating the problem at the collection stage. This is particularly important when the people doing the visualisation aren't the same people writing the collection code, and thus don't know if the people creating the dashboards will all be sufficiently operationally sophisticated.

> to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualisation of that data makes clear the separation.

It's a little messier than that. Depending exactly on how the data is collected, such a split could make some analyses more difficult or impossible. For example I need the overall latency increase in order to see if this server is entirely responsible for the overall latency increase I see one level up in the stack, or if there's some other or additional problem that needs explanation. There's no equivalent math for success or failure.

Put another way, the math on the overall works the way your intuition thinks it does. The split out version is more subtle.

>(bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.)

I work on Prometheus which is a metrics system. Honeycomb seems to be based on event logs. There's logic to removing the success/failure split for duration metrics as I suggest, but it'd be insanity to remove it for event logs. So in your case it is purely a visualisation problem, whereas for us losing granularity at the collection stage is an option (and sometimes required on cardinality grounds).

The terminology the article uses (incrementing a counter at an instrumentation point) led me to believe we were discussing only metrics.

The way I would see things is that you'd use a metrics-based like Prometheus to locate and understand the general problem and which subsystems are involved, and then start using log-based tools like Honeycomb as you dig further in to see which exact requests are at fault. They're complementary tools with different tradeoffs.

I've written about this in more depth at http://thenewstack.io/classes-container-monitoring/

> rather have one overall duration metric

What exactly would that be?

The duration metric as recommended by the article, but not broken out by success/failure.
yea that's not good advice, but "don't rely on overstuffed dashboards" is good advice.

you should only have simple dashboards (and alerting) for KPIs and end to end checks. Everything else should be instrumented and debugged using a real time sorting and slicing tool, esp if you have a complex system (microservices, distributed system, polygot persistence).