Hacker News new | ask | show | jobs
by Osmose 841 days ago
This isn't an unknown idea outside of Meta, it's just really expensive, especially if you're using a vendor and not building your own tooling. Prohibitively so, even with sampling.
4 comments

Exactly. When they say

> Unlike with prometheus, however, with Wide Events approach we don’t need to worry about cardinality

This is hinting at the hidden reason why not everyone does it. You have to 'worry' about cardinality because Prometheus is pre-aggregating data so you can visualize it fast, and optimizing storage. If you want the same speed on a massive PB-scale data lake, with an infinite amount of unstructured data, and in the cloud instead of your own datacenters, it's gonna cost you a lot, and for most companies it is not a sensible expense.

It does work at smaller scale though, we once had an in-house system like this that worked well. Eventually user events were moved to MixPanel, and everything else to Datadog, metrics/logs/traces + a migration to OpenTel. It took months and added 2-digit monthly bills, and in the end debugging or resolving incidents wasn't much improved over having instant access to events and business metrics. Whoever figures out a system that can do "wide events" in a cost-effective way from startup to unicorn scale will absolutely make a killing.

https://victoriametrics.com/ would definitely recommend anyone having performance issues with Prometheus to give VictoriaMetrics a try.
Once you plug long term storage onto your Prometheus, do you really need the main Prometheus instances anymore?

Here’s an article about this idea: https://datadrivendrivel.com/posts/rmrfprometheus/

You can substitute the Grafana Agents for OTEL collectors as well.

While Grafana Agent uses less resources than Prometheus, there is more optimized Prometheus-compatible scraper and router exists - vmagent [1]. I'd recommend you giving Grafana Agent and vmagent the same workload and comparing their resource usage.

P.S. Prometheus itself can also act as a lightweight agent, which collects metrics and forwards them to the configured remote storage [2].

[1] https://docs.victoriametrics.com/vmagent/

[2] https://prometheus.io/blog/2021/11/16/agent/

That's just kicking the can over to an object storage API instead of managing disks.
And comes with all the downsides of Prometheus as well
Could you pelase elaborate more on "comes with all the downsides of Prometheus"?
Or, you can use Thanos, the de-facto standard with the biggest OSS community.
Thanos/Mimir community doesn't help to resolve configuration routine or even bigger resource consumption for a huge setup.
Bigger resource consumption of what exactly? Leaf Prometheus instances or the Thanos/Mimir stack compared to VictoriaMetrics? Have you seen a large scale migration between the two, with actual numbers?
A few of interesting real-world large-scale migrations are highlighted at https://docs.victoriametrics.com/casestudies/
Not everything emits wide events. Maybe you can get the entire application layer like that, but there is also value in logs and metrics emitted from the rest of the infra stack.

To be fair, you could probably store and represent everything as wide events and build visualization tools out of that that can combine everything together, even if they are sourced from something else.

Wide events seem to be "structured logs with focused schemas" (maybe also published in a special way beyond writing to stdout) but most places I've worked would call that "logging" not "wide events".

The reasons we don't use them for everything are as others in the thread say: it's expensive. Metrics (just the numbers, nothing else) can be compressed and aggregated extremely efficiently, hence cheaply. Logs are more expensive due to their arbitrary contents.

It's all due to expense really.

Columnar storage stores data very efficiently, too - because it compresses data of a similar nature (columns). Check e.g. ClickHouse on this matter: https://clickhouse.com/docs/en/about-us/distinctive-features, https://clickhouse.com/blog/working-with-time-series-data-an...

So I wouldn't say that events are "expensive" while metrics are "cheap" - both depend on the actual implementation, and events can be cheap too.

And so of course if you have to optimise things, you would need to drop some information you pass to the events, but you would need to do the same for metrics (reduce the number of metrics emitted, reduce the prometheus labels,...).

If you have small pre-defined sets of events in data structures that compress well. That is not the case for any real system.

> And so of course if you have to optimise things, you would need to drop some information you pass to the events, but you would need to do the same for metrics (reduce the number of metrics emitted, reduce the prometheus labels,...).

Those are entirely different orders of magnitude both when it comes to size and how much usefulness you lose. In modern storage backends like Victoriametrics a counter gonna cost you around byte per metric per probe. And as you emit them periodically, that is essentially independent of incoming traffic

Capturing the requests into event/trace/whatever other name they gave to logs this month is many times that and is multiplied by traffic.

> Those are entirely different orders of magnitude both when it comes to size and how much usefulness you lose. In modern storage backends like Victoriametrics a counter gonna cost you around byte per metric per probe. And as you emit them periodically, that is essentially independent of incoming traffic

I thought this argument was about whether wide events can be used for metrics or metrics is a completely different concept. If we want to emulate metrics in events, we would also make them periodically independently of the traffic. Like emit them once in a while. Pretty much like Prometheus scraping works

Storing telemetry efficiently is only part of what Monitoring is supposed to do. The other part is querying: ad-hoc queries, dashboards, alerting queries executed each 15s or so. For querying to work fast, there has to be an efficient index or multiple indexes depending on the query. Since you referred ClickHouse as efficient columnar storage, please see what makes it different from a time series database - https://altinity.com/wp-content/uploads/2021/11/How-ClickHou...
And yet people use ClickHouse quite effectively for this very problem, see the comment here: https://news.ycombinator.com/item?id=39549218

There are also time-series databases out there that are OK with high cardinality: https://questdb.io/blog/2021/06/16/high-cardinality-time-ser...

The whole point of wide events is recording an arbitrary set of key value pairs. How do you propose storing that in a columnar datastore?
I can't speak for others, but at Honeycomb that's what we do. There's some details in this blog post that might be interesting: https://www.honeycomb.io/blog/why-observability-requires-dis...
I worked on Scuba, inside and outside of Meta (Interana), and yeah - It was expensive AF. I recommend focusing on metrics first. Use analytics logging sparingly, and understand the statistics of how metrics work, because without understanding those statistics you'll misread your events anyway.

This is not to say that wide events aren't worth it - For many things, something like Scuba or Bigquery are invaluable. There's ways to optimize. But we're talking about "One of AWS's largest machines" vs "A couple cores", and I suggest learning Prometheus first.

> understand the statistics of how metrics work

Haha, since you worked on Scuba I’ll mention IMO this point was by far the biggest flaw of ODS. No one ever performed the metric rollups correctly. Average of averages? And at what granularity? ODS downsampled the older time series data but now perhaps you’re taking a percentile over a “max of maxes”. Except it only sometimes used that method of downsampling automatically.

And I seem to recall the labels “daily”, “weekly”, and “monthly” not being intuitive either, and two of them meant the same thing... that was quite a mess to work with.

A lot of the autoscaling systems were wonky because the ODS metrics they were based upon didn’t represent what people thought they did.

Never in the world I would have expected my post to cause the discussion about ODS flaws :D
I don’t know that’s true. My last two very-not-meta-sized companies have both had systems that were very cost effective and essentially what the article describes. It’s not the simplest thing to put in place, but far from unapproachable.

I think on if the big hills is moving to a culture that values observability (or whatever you choose to call it, I prefer forensic debugging). It’s another thing to understand and worry about and it helps tremendously if there are good, highly visible examples of it.

Edit: Typo.

Could you share some specifics of how it could be approached?
I don't know what that commentor has in mind. My own experience building this up is to start with usable information and not try to instrument everything at once. Those are usually:

- some way to get to errors when they happen

- zeroing in on the key performance indicators for your application, and relating them to infra metrics, particularly resources (because cpu, mem, storage, and bandwidth costs money).

Unless you have both domain and infra knowledge, it will be hard to know ahead of time.

For a stateless web app backed by a db, you're typically starting with:

- request metrics (req/s, latency)

- authenticated user activity

- db metrics (such as what you'd get with pganalyze)

It's when there are resource pressure that things get interesting. Here, you have product-fit, you have user traction and growth, but now your app is falling down because it is popular.

It is tempting to just crank things up horizontally and say, you're trying to land-grab users ... but your team will never develop the discipline to develop scalable and reliable software. It's here that you start adding instrumentation to find bottlenecks -- whether that is instrumenting spans, adding metrics, optimizing queries, etc. You also need to craft the dashboard to give actionable intelligence. Here's where Datadog's notebook feature is great -- you explore (and collaborate) with the notebook until you can find the bottleneck, and then export the useful metrics into a dashboard. Then you set up the monitoring, because you have found the key performance indicators.

It's this active search to understand what is going on in _both_ app and infra that shows you the limits of the current architectural designs, guide what you need to do, and validate the architectural and engineering decisions for the future. This active search may involve tools beyond OpenTelemetry or Datadog or Honeycomb -- maybe you have to attach a REPL, or go poking around a memory profiler.

What you _don't_ do is blindly adding these things because having the capability somehow makes things better. Rather, you incrementally improve your capability in order to solve your present scalability and reliability problems with your app and its infra.

Maybe this is a dumb question but why wouldn’t it be cost effective to pre-aggregate counts occasionally and sample on the fly?