| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by btown 331 days ago

The question is whether you want to do your aggregation by unit time at the application level, or at an observability layer. You're absolutely right that the end user of metrics wants to see things grouped by time - but what if they want to filter down to "events where attribute X had value Y, in 10 second increments" but you had decided to group your metrics by 15 second increments without regard to attribute X?

Various companies, both in-house for big tech and then making this more widely accessible, started to answer this question by saying "pump all your individual logs in structured form into a giant columnar database that can handle nearly arbitrary numbers of columns, and we'll handle letting you slice and dice metrics out of any combination of columns you want. And if you have an ID follow the session around between different microservices, and maybe even all the way to the browser session, you can track the entire distributed system."

Different people might say that Datadog, Honeycomb, or Clickhouse (and the various startups backed by Clickhouse as a database) were the ones to make this pattern mainstream, and all of them pushed the boundaries in one way or another - nowadays, there's a whole https://opentelemetry.io/ standard, and if you emit according to that, you can plug in various sinks made by various startups, and choose the metrics UX that makes the most sense for your use case.

I'm a huge fan of Honeycomb - when I know a certain issue is happening, I can immediately see a chart showing latencies and frequencies, and click any hot spot to filter out the individual traces that exhibit the behavior and trace the end-to-end user journey, with all the different logs from all the systems touched by that request. And I can even begin this discovery from a single bug report by a single user whose ID I know. It's not just metrics - it's operational support. And if I'd pre-aggregated logs, I'd have none of this.

But of course, there are systems where this doesn't make sense! Large batch jobs, high-performance systems with orders of magnitudes more events than a standard web application... it's not one size fits all. That said, I think knowing about modern observability should be part of every developer's toolkit.