| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Osmose 841 days ago
	This isn't an unknown idea outside of Meta, it's just really expensive, especially if you're using a vendor and not building your own tooling. Prohibitively so, even with sampling.

4 comments

ricardobeat 841 days ago

Exactly. When they say

> Unlike with prometheus, however, with Wide Events approach we don’t need to worry about cardinality

This is hinting at the hidden reason why not everyone does it. You have to 'worry' about cardinality because Prometheus is pre-aggregating data so you can visualize it fast, and optimizing storage. If you want the same speed on a massive PB-scale data lake, with an infinite amount of unstructured data, and in the cloud instead of your own datacenters, it's gonna cost you a lot, and for most companies it is not a sensible expense.

It does work at smaller scale though, we once had an in-house system like this that worked well. Eventually user events were moved to MixPanel, and everything else to Datadog, metrics/logs/traces + a migration to OpenTel. It took months and added 2-digit monthly bills, and in the end debugging or resolving incidents wasn't much improved over having instant access to events and business metrics. Whoever figures out a system that can do "wide events" in a cost-effective way from startup to unicorn scale will absolutely make a killing.

AeroNotix 841 days ago

https://victoriametrics.com/ would definitely recommend anyone having performance issues with Prometheus to give VictoriaMetrics a try.

jonasdegendt 841 days ago

Once you plug long term storage onto your Prometheus, do you really need the main Prometheus instances anymore?

Here’s an article about this idea: https://datadrivendrivel.com/posts/rmrfprometheus/

You can substitute the Grafana Agents for OTEL collectors as well.

valyala 839 days ago

While Grafana Agent uses less resources than Prometheus, there is more optimized Prometheus-compatible scraper and router exists - vmagent [1]. I'd recommend you giving Grafana Agent and vmagent the same workload and comparing their resource usage.

P.S. Prometheus itself can also act as a lightweight agent, which collects metrics and forwards them to the configured remote storage [2].

[1] https://docs.victoriametrics.com/vmagent/

[2] https://prometheus.io/blog/2021/11/16/agent/

AeroNotix 841 days ago

That's just kicking the can over to an object storage API instead of managing disks.

camel_gopher 841 days ago

And comes with all the downsides of Prometheus as well

dengolius 840 days ago

Could you pelase elaborate more on "comes with all the downsides of Prometheus"?

uaas 841 days ago

Or, you can use Thanos, the de-facto standard with the biggest OSS community.

dengolius 840 days ago

Thanos/Mimir community doesn't help to resolve configuration routine or even bigger resource consumption for a huge setup.

uaas 840 days ago

Bigger resource consumption of what exactly? Leaf Prometheus instances or the Thanos/Mimir stack compared to VictoriaMetrics? Have you seen a large scale migration between the two, with actual numbers?

valyala 839 days ago

A few of interesting real-world large-scale migrations are highlighted at https://docs.victoriametrics.com/casestudies/

hosh 841 days ago

Not everything emits wide events. Maybe you can get the entire application layer like that, but there is also value in logs and metrics emitted from the rest of the infra stack.

To be fair, you could probably store and represent everything as wide events and build visualization tools out of that that can combine everything together, even if they are sourced from something else.

lelandbatey 841 days ago

Wide events seem to be "structured logs with focused schemas" (maybe also published in a special way beyond writing to stdout) but most places I've worked would call that "logging" not "wide events".

The reasons we don't use them for everything are as others in the thread say: it's expensive. Metrics (just the numbers, nothing else) can be compressed and aggregated extremely efficiently, hence cheaply. Logs are more expensive due to their arbitrary contents.

It's all due to expense really.

isburmistrov 841 days ago

Columnar storage stores data very efficiently, too - because it compresses data of a similar nature (columns). Check e.g. ClickHouse on this matter: https://clickhouse.com/docs/en/about-us/distinctive-features, https://clickhouse.com/blog/working-with-time-series-data-an...

So I wouldn't say that events are "expensive" while metrics are "cheap" - both depend on the actual implementation, and events can be cheap too.

And so of course if you have to optimise things, you would need to drop some information you pass to the events, but you would need to do the same for metrics (reduce the number of metrics emitted, reduce the prometheus labels,...).

adql 841 days ago

If you have small pre-defined sets of events in data structures that compress well. That is not the case for any real system.

> And so of course if you have to optimise things, you would need to drop some information you pass to the events, but you would need to do the same for metrics (reduce the number of metrics emitted, reduce the prometheus labels,...).

Those are entirely different orders of magnitude both when it comes to size and how much usefulness you lose. In modern storage backends like Victoriametrics a counter gonna cost you around byte per metric per probe. And as you emit them periodically, that is essentially independent of incoming traffic

Capturing the requests into event/trace/whatever other name they gave to logs this month is many times that and is multiplied by traffic.

isburmistrov 841 days ago

> Those are entirely different orders of magnitude both when it comes to size and how much usefulness you lose. In modern storage backends like Victoriametrics a counter gonna cost you around byte per metric per probe. And as you emit them periodically, that is essentially independent of incoming traffic

I thought this argument was about whether wide events can be used for metrics or metrics is a completely different concept. If we want to emulate metrics in events, we would also make them periodically independently of the traffic. Like emit them once in a while. Pretty much like Prometheus scraping works

hagen1778 840 days ago

Storing telemetry efficiently is only part of what Monitoring is supposed to do. The other part is querying: ad-hoc queries, dashboards, alerting queries executed each 15s or so. For querying to work fast, there has to be an efficient index or multiple indexes depending on the query. Since you referred ClickHouse as efficient columnar storage, please see what makes it different from a time series database - https://altinity.com/wp-content/uploads/2021/11/How-ClickHou...

isburmistrov 840 days ago

And yet people use ClickHouse quite effectively for this very problem, see the comment here: https://news.ycombinator.com/item?id=39549218

There are also time-series databases out there that are OK with high cardinality: https://questdb.io/blog/2021/06/16/high-cardinality-time-ser...

flaminHotSpeedo 841 days ago

The whole point of wide events is recording an arbitrary set of key value pairs. How do you propose storing that in a columnar datastore?

phillipcarter 841 days ago

I can't speak for others, but at Honeycomb that's what we do. There's some details in this blog post that might be interesting: https://www.honeycomb.io/blog/why-observability-requires-dis...

GauntletWizard 841 days ago

I worked on Scuba, inside and outside of Meta (Interana), and yeah - It was expensive AF. I recommend focusing on metrics first. Use analytics logging sparingly, and understand the statistics of how metrics work, because without understanding those statistics you'll misread your events anyway.

This is not to say that wide events aren't worth it - For many things, something like Scuba or Bigquery are invaluable. There's ways to optimize. But we're talking about "One of AWS's largest machines" vs "A couple cores", and I suggest learning Prometheus first.

Xcelerate 841 days ago

> understand the statistics of how metrics work

Haha, since you worked on Scuba I’ll mention IMO this point was by far the biggest flaw of ODS. No one ever performed the metric rollups correctly. Average of averages? And at what granularity? ODS downsampled the older time series data but now perhaps you’re taking a percentile over a “max of maxes”. Except it only sometimes used that method of downsampling automatically.

And I seem to recall the labels “daily”, “weekly”, and “monthly” not being intuitive either, and two of them meant the same thing... that was quite a mess to work with.

A lot of the autoscaling systems were wonky because the ODS metrics they were based upon didn’t represent what people thought they did.

isburmistrov 841 days ago

Never in the world I would have expected my post to cause the discussion about ODS flaws :D

mlhpdx 841 days ago

I don’t know that’s true. My last two very-not-meta-sized companies have both had systems that were very cost effective and essentially what the article describes. It’s not the simplest thing to put in place, but far from unapproachable.

I think on if the big hills is moving to a culture that values observability (or whatever you choose to call it, I prefer forensic debugging). It’s another thing to understand and worry about and it helps tremendously if there are good, highly visible examples of it.

Edit: Typo.

gtirloni 841 days ago

Could you share some specifics of how it could be approached?

hosh 841 days ago

I don't know what that commentor has in mind. My own experience building this up is to start with usable information and not try to instrument everything at once. Those are usually:

- some way to get to errors when they happen

- zeroing in on the key performance indicators for your application, and relating them to infra metrics, particularly resources (because cpu, mem, storage, and bandwidth costs money).

Unless you have both domain and infra knowledge, it will be hard to know ahead of time.

For a stateless web app backed by a db, you're typically starting with:

- request metrics (req/s, latency)

- authenticated user activity

- db metrics (such as what you'd get with pganalyze)

It's when there are resource pressure that things get interesting. Here, you have product-fit, you have user traction and growth, but now your app is falling down because it is popular.

It is tempting to just crank things up horizontally and say, you're trying to land-grab users ... but your team will never develop the discipline to develop scalable and reliable software. It's here that you start adding instrumentation to find bottlenecks -- whether that is instrumenting spans, adding metrics, optimizing queries, etc. You also need to craft the dashboard to give actionable intelligence. Here's where Datadog's notebook feature is great -- you explore (and collaborate) with the notebook until you can find the bottleneck, and then export the useful metrics into a dashboard. Then you set up the monitoring, because you have found the key performance indicators.

It's this active search to understand what is going on in _both_ app and infra that shows you the limits of the current architectural designs, guide what you need to do, and validate the architectural and engineering decisions for the future. This active search may involve tools beyond OpenTelemetry or Datadog or Honeycomb -- maybe you have to attach a REPL, or go poking around a memory profiler.

What you _don't_ do is blindly adding these things because having the capability somehow makes things better. Rather, you incrementally improve your capability in order to solve your present scalability and reliability problems with your app and its infra.

staticautomatic 840 days ago

Maybe this is a dumb question but why wouldn’t it be cost effective to pre-aggregate counts occasionally and sample on the fly?