Hacker News new | ask | show | jobs
by KaiserPro 2152 days ago
A few things I have learnt along the way:

Logs are great, but only once you've identified the problem. If you are searching through logs to _find_ a problem, its far too late.

Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming (example: dont use access logs to collect 4xx/5xx and make a graph, collate and push the metrics directly)

Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x

Alerts must be actionable.

Alerts rules must be based on sensible clear cut rules: service x's response time is breeching its SLA not service x's response time is double its average for this time in may.

10 comments

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming

Yeah nah, but, okay, nah yeah.

Generating metrics in the app is much more intrusive, and requires that you figure out the metrics you need ahead of time. It adds dependencies, sockets, and threads to your app.

Unless you're very careful, it's also easy to end up double-aggregating, computing medians of medians and other meaningless pseudo-statistics - if you're using the Dropwizard Metrics library, for example, you've already lost.

If you output structured log events, where everything is JSON or whatever and there are common schema elements, you can easily pull out the metrics you need, configure new ones on the fly, and retrospectively calculate them if you keep a window of log history.

When i've worked on systems with both pre- and post-calculated metrics, the post-calculated metrics were vastly more useful.

The huge, virtually showstopping, caveat here is that there is lots of decent, easy-to-use tooling for pre-calculated metrics, and next to nothing for post-calculated metrics. You can drop in some libraries and stand up a couple of servers and have traditional metrics going in a day, with time for a few games of table tennis. You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.

Anyway if there's a VC reading this with twenty million quid burning a hole in their pocket who isn't fussy about investing in companies with absolutely no path to profitability, let me know, and i'll do a startup to fix all this. I'll even put the metrics on the blockchain for you, guaranteed street cred.

> Unless you're very careful, it's also easy to end up double-aggregating,

Oh no, never do anything fancy on the client end. yeah thats total trash. Any client that does any kind of aggregating is a massive pain in the arse.

Counters are good enough for 90% of everything you want. You can turn counters into hits per second easily. Plus they are more resistant to time based averaging. If you do your stats correctly, you can even has resetting counters create nice smooth graphs (non negative derivatives are a god send)

> Dropwizard

Yes, this is a library that argues strongly against the use of metrics. From what I recall 1 node of casasndra will output close to 50,000 metrics by default. That is too much.

When a team I worked with were migrating away from splunk to graphite/grafana, they shat out something close to a million metrics. 99.8% were totally useless.

> You need to build and bodge a terrifying pile of stuff to get post-calculated metrics going.

Yes! I think thats my main objection. Its so bloody expensive to do post-hock metrics. you can buy in splunk, but thats horrifically expensive. Or you can use an open source version and loose 4 person years before you even get a graph.

> if you're using the Dropwizard Metrics library, for example, you've already lost.

Can you go into a bit more detail here? Curious to know where Dropwizzard goes wrong.

I prefer to use the Prometheus client libraries where possible. Prometheus' data model is "richer" -- metric families and labels, rather than just named metrics. Adapting from Dropwizzard to Prometheus is a pain, and never results in the data feeling "native" to Prometheus.

I think they just mean the host is aggregating, so any further aggregation is compounded slant time the data. Like StatsD’s default is shipping metrics every 10s, so if you graph it and your graph rolls up those data points into 10 minute data points (cuz you’re viewing a week at once), then you’re averaging an average. Or averaging a p95. People often miss that this is happening, and it can drastically change the narrative.
Yes, exactly this. It's the fact that you're doing aggregation in two places. Since you're always going to be aggregating on the backend, aggregating in the app is bad news.

It may be interesting to think about the class of aggregate metrics that you can safely aggregate. Totals can be summed. Counts can be summed. Maxima can be maxed. Minima can be minned. Histograms can be summed (but histograms are lossy). A pair of aggregatable metrics can be aggregated pairwise; a pair of a total and a count lets you find an average.

Medians and quantiles, though, can't be combined, and those are what we want most of the time.

Someone who loves functional programming can tell us if metrics in this class are monoids or what.

There is an unjustly obscure beast called a t-digest which is a bit like an adaptive histogram; it provides a way to aggregate numbers such that you can extract medians and quantiles, and the aggregates can be combined:

https://github.com/tdunning/t-digest

The problem is post calculating is so slow. At least from my naive viewpoint. I can load dozens of graphs in datadog in seconds, can change tags or time frame and takes literally a second to load. Our Splunk dashboards can take over a minute to load, and reload for any change is more waiting.
Splunk taking minutes to load dashboard is not a problem imposed by post-calculation, it's more of a problem of lack of schema.

Most post calculation works on free text logs and thus has to regex it's way to a solution. But it doesn't have to be that way; that's why the original poster talked about a lack of tooling in the post-calculation world

You're optimizing for the wrong thing. The hard part about this space isn't extracting value from data, it's physically shipping the data through the infrastructure and into the relevant systems. Metrics are so great compared to logs precisely because they're precalculated (read: highly compressed) before leaving the originating service.
That does sound like you are assuming that everyone always talks about large-scale environments with dozens or more machines?
Everything we're talking about here is trivial until you reach a minimum scale. "Dozens of machines" is still trivial, to be honest.
Not for many people running such environments. So you either see with very minimal setups without tooling to help (which is survivable at small scales, but inefficient and mindnumbing), or towers of complexity that followed widely shared advice by people always assuming massive scale.
I miss the powerful metrics and logging systems that I used in Amazon.

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. > Spend that producing high quality metrics directly from the apps

Absolutely not. Most application metric systems generate metrics as text strings with a simple format that is parsed by the metric collector.

This is what we also call a structured log. Parsing such text strings takes very little CPU.

All logs and metrics represent events. A good approach is to prefer numerical values where possible, but only for quantities that are comparable. Metrics are for the "how many?" question.

But never forget to log text events, because you need to answer the "what happened?" question.

Don't be afraid of generating too many different metrics but avoid too frequent datapoints and unnecessary verbosity in logs.

Never dump complex objects "just in case". Treat overlogging and underlogging as a bug.

Spend time every day in reviewing the metric dashboards and improve them constantly.

If it takes more that 10 seconds do add a new non-obvious chart (e.g. to calculate a ratio between 2 metrics or a percentile or other computation) throw away your charting system.

Lying with numbers is very easy: always look at distributions, not just instant values. Some metrics must be represented as percentiles and min/avg/max are meaningless.

Percentiles are good for ignoring meaningless outliers, but always count the outliers to ensure that you are not ignoring meaningful data. Especially during incidents.

Metrics and text logs tell a story together. Process, correlate and visualize them together as much as possible.

> Processing/streaming logs to get metrics is a terrible waste of time, energy and money. Spend that producing high quality metrics directly from the apps you are looking after/writing/decomming

The other side is that I don't know what metrics I'll want until later.

When do you think it's better to pull metrics from structured logs vs generating metrics in app?

You can go the https://www.honeycomb.io/ way and make structured logs your metrics. It will cost you a lot in storage, but simplifies a lot. Just throw properly structured logs into storage as long as you query them efficiently (which honeycomb provides)
I think the only times it ever really makes sense to use logs to generate metrics are fairly limited:

1. You haven't yet instrumented the application with metrics yet.

2. The logs are from a third party tool that don't emit metrics

3. The log format is well defined and doesn't change (I'd still prefer native metrics)

Otherwise the issue is that logging messages can and do change over the lifetime of an application. Relying on the content of the log for metrics becomes an implicit API that's not obvious to developers working on the code. I've seen issues of broken monitoring and alerting because a refactor changed log formatting and content. Much better to be explicit about metrics and instrument them directly.

Aha! that is the eternal question.

TL;DR:

almost never. structured logs are expensive in terms of infra, management and query time. Storing logs just in case is much more expensive at any kind of scale compared to metrics alone.

Long answer:

A lot of it depends on what the service/program is meant to be doing.

If we take a proxying webs service router for example listening on example.com/* We would want metrics to tell us how well its doing for its specific job, and any upstream services.

So for each service URL we'd want at least a hit count for 2xx, 3xx, specific 4xx and 5xx return codes. We'd also want the time taken to process that request.

We'd also probably want to know the total number of active connection to back end, and total clients connected. Memory and CPU usage would also be a given.

From that we could easily ascertain the health of upstream services, the performance, and total load (which is useful for autoscaling of either the service router, or the upstream apps)

I think it requires sitting down with a peice of paper and imagining your service/app breaking, and then working back to see how that would look. Once you've done that, you can figure out some counters to keep track of those thins.

On the rest I agree but on

> Raw metrics are pretty useless. They need to be manipulated into buisness goals: service x is producing 3% 5xx errors vs % of visitors unable to perform action x

I think in general the business goals metrics are OK but you still need to keep lower level metrics as well, otherwise it would be more difficult to pinpoint the exact failure, you will just know that a % of visitors is unable to perform action X. In a moderately-complex system a user-level action X is probably composed by several low-level services.

I agree wholeheartedly.

I was trying to get across that just because you collect metrics it doesn't make them useful. I encourage people to generate metrics for everything, we can always join them together later to make something useful.

I think what I should have said is: "Collect metrics for everything, but be sure to display them is a way thats relevant to the customer"

Gavin from Zebrium here. We've found that if only you somehow knew what you were monitoring for in logs, they can be a great source of detecting (and then describing) the long tail of unknown/unknowns (failure modes with unknown symptoms and causes). Our approach is to be able to find these patterns in near real-time using ML. This blog by our CTO explains the tech with some good examples: https://www.zebrium.com/blog/is-autonomous-monitoring-the-an....
> If you are searching through logs to _find_ a problem, its far too late.

To be fair, this is addressed in the article which links to Netflix's blog on the topic and how they do so effectively at their scale: https://netflixtechblog.com/lessons-from-building-observabil...

Agreed for most what was said there. Still, I find that people mostly use SLA as only thing important to track for alerting and incidents arousal. There is a lot of said about importance of defining solid SLI - Service Level Indicators which are aligned to SLO - Service Level Objectives SLAs are usually given to external user of SaaS, not very useful for SRE team.
> Processing/streaming logs to get metrics is a terrible waste of time, energy and money.

Only if you're using Elastic.

I was going to try out Elastic APM for self-hosted APM option. Would that be the same case of a waste of time, energy, money? TIA for any insights!
The difference in metrics seems to be proportional to the level of understanding about how the organization works.
sometimes log volume would be too high (high TX count)

although in most of the systems in my career it has not been the case.