Hacker News new | ask | show | jobs
by twic 2152 days ago
> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming

Yeah nah, but, okay, nah yeah.

Generating metrics in the app is much more intrusive, and requires that you figure out the metrics you need ahead of time. It adds dependencies, sockets, and threads to your app.

Unless you're very careful, it's also easy to end up double-aggregating, computing medians of medians and other meaningless pseudo-statistics - if you're using the Dropwizard Metrics library, for example, you've already lost.

If you output structured log events, where everything is JSON or whatever and there are common schema elements, you can easily pull out the metrics you need, configure new ones on the fly, and retrospectively calculate them if you keep a window of log history.

When i've worked on systems with both pre- and post-calculated metrics, the post-calculated metrics were vastly more useful.

The huge, virtually showstopping, caveat here is that there is lots of decent, easy-to-use tooling for pre-calculated metrics, and next to nothing for post-calculated metrics. You can drop in some libraries and stand up a couple of servers and have traditional metrics going in a day, with time for a few games of table tennis. You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.

Anyway if there's a VC reading this with twenty million quid burning a hole in their pocket who isn't fussy about investing in companies with absolutely no path to profitability, let me know, and i'll do a startup to fix all this. I'll even put the metrics on the blockchain for you, guaranteed street cred.

4 comments

> Unless you're very careful, it's also easy to end up double-aggregating,

Oh no, never do anything fancy on the client end. yeah thats total trash. Any client that does any kind of aggregating is a massive pain in the arse.

Counters are good enough for 90% of everything you want. You can turn counters into hits per second easily. Plus they are more resistant to time based averaging. If you do your stats correctly, you can even has resetting counters create nice smooth graphs (non negative derivatives are a god send)

> Dropwizard

Yes, this is a library that argues strongly against the use of metrics. From what I recall 1 node of casasndra will output close to 50,000 metrics by default. That is too much.

When a team I worked with were migrating away from splunk to graphite/grafana, they shat out something close to a million metrics. 99.8% were totally useless.

> You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.

Yes! I think thats my main objection. Its so bloody expensive to do post-hock metrics. you can buy in splunk, but thats horrifically expensive. Or you can use an open source version and loose 4 person years before you even get a graph.

> if you're using the Dropwizard Metrics library, for example, you've already lost.

Can you go into a bit more detail here? Curious to know where Dropwizzard goes wrong.

I prefer to use the Prometheus client libraries where possible. Prometheus' data model is "richer" -- metric families and labels, rather than just named metrics. Adapting from Dropwizzard to Prometheus is a pain, and never results in the data feeling "native" to Prometheus.

I think they just mean the host is aggregating, so any further aggregation is compounded slant time the data. Like StatsD’s default is shipping metrics every 10s, so if you graph it and your graph rolls up those data points into 10 minute data points (cuz you’re viewing a week at once), then you’re averaging an average. Or averaging a p95. People often miss that this is happening, and it can drastically change the narrative.
Yes, exactly this. It's the fact that you're doing aggregation in two places. Since you're always going to be aggregating on the backend, aggregating in the app is bad news.

It may be interesting to think about the class of aggregate metrics that you can safely aggregate. Totals can be summed. Counts can be summed. Maxima can be maxed. Minima can be minned. Histograms can be summed (but histograms are lossy). A pair of aggregatable metrics can be aggregated pairwise; a pair of a total and a count lets you find an average.

Medians and quantiles, though, can't be combined, and those are what we want most of the time.

Someone who loves functional programming can tell us if metrics in this class are monoids or what.

There is an unjustly obscure beast called a t-digest which is a bit like an adaptive histogram; it provides a way to aggregate numbers such that you can extract medians and quantiles, and the aggregates can be combined:

https://github.com/tdunning/t-digest

The problem is post calculating is so slow. At least from my naive viewpoint. I can load dozens of graphs in datadog in seconds, can change tags or time frame and takes literally a second to load. Our Splunk dashboards can take over a minute to load, and reload for any change is more waiting.
Splunk taking minutes to load dashboard is not a problem imposed by post-calculation, it's more of a problem of lack of schema.

Most post calculation works on free text logs and thus has to regex it's way to a solution. But it doesn't have to be that way; that's why the original poster talked about a lack of tooling in the post-calculation world

You're optimizing for the wrong thing. The hard part about this space isn't extracting value from data, it's physically shipping the data through the infrastructure and into the relevant systems. Metrics are so great compared to logs precisely because they're precalculated (read: highly compressed) before leaving the originating service.
That does sound like you are assuming that everyone always talks about large-scale environments with dozens or more machines?
Everything we're talking about here is trivial until you reach a minimum scale. "Dozens of machines" is still trivial, to be honest.
Not for many people running such environments. So you either see with very minimal setups without tooling to help (which is survivable at small scales, but inefficient and mindnumbing), or towers of complexity that followed widely shared advice by people always assuming massive scale.