|
|
|
|
|
by maplebed
3434 days ago
|
|
By creating events that contain both the duration of the request and whether it succeeded, you can create graphs that show you the detail you need. Unless you include those data together at the beginning, it will be impossible to tease them apart later on. Combining them into one graph will likely conceal the difference in the two cases, as you describe, unless you feed them in to a system that an natively tease them apart as easily as show them together (such as http://honeycomb.io). So it seems like the disagreement is more about visualization than collection (the section of the blog in which that quote appears). The originally quoted advice, to show "the duration it took to serve a response to a request, also labelled by successes or errors" remains good advice, so long as the visualization of that data makes clear the separation. I absolutely agree that careful consideration is required when choosing what to put on dashboards to avoid confusion. That seems to be a separate issue. (bias alert - I work on Honeycomb, and care deeply about collecting data in a way that lets you pull away the irrelevant data to illuminate the real problems.) |
|
This is a place where I think you guys could beat what other 3rd party monitoring tools are doing. I work with some of your guest bloggers, and I work on a subsystem with its own dashboard: about 50 charts. To make bringing new teammates a sensible experience, we need both a layer of alerts on top of the charts, and then a set of rules of thumb, that should be programmed if the alerting system was good enough, that put the alerts together into realistic failure cases: if X and Y triggered, but Z didn't, then chances are this piece is probably the culprit.
There's also opportunities in visualizations that aren't chart based: We used to have something like that for another complex system in another employer, but that's expensive, custom work, unless you join forces with something that understands were all your services are, knows all ingress and egress rules, and thus could automatically generate a picture of your system, along with understanding the instrumentation: So leave that until you merge with SkylinerHQ or something.
That said, I think you guys are heading towards a good, marketable product as it is. Fixing the annoying the statsd/splunk divide of older monitoring would probably make us buy it already.