| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ricardobeat 844 days ago

Exactly. When they say

> Unlike with prometheus, however, with Wide Events approach we don’t need to worry about cardinality

This is hinting at the hidden reason why not everyone does it. You have to 'worry' about cardinality because Prometheus is pre-aggregating data so you can visualize it fast, and optimizing storage. If you want the same speed on a massive PB-scale data lake, with an infinite amount of unstructured data, and in the cloud instead of your own datacenters, it's gonna cost you a lot, and for most companies it is not a sensible expense.

It does work at smaller scale though, we once had an in-house system like this that worked well. Eventually user events were moved to MixPanel, and everything else to Datadog, metrics/logs/traces + a migration to OpenTel. It took months and added 2-digit monthly bills, and in the end debugging or resolving incidents wasn't much improved over having instant access to events and business metrics. Whoever figures out a system that can do "wide events" in a cost-effective way from startup to unicorn scale will absolutely make a killing.

2 comments

AeroNotix 844 days ago

https://victoriametrics.com/ would definitely recommend anyone having performance issues with Prometheus to give VictoriaMetrics a try.

jonasdegendt 844 days ago

Once you plug long term storage onto your Prometheus, do you really need the main Prometheus instances anymore?

Here’s an article about this idea: https://datadrivendrivel.com/posts/rmrfprometheus/

You can substitute the Grafana Agents for OTEL collectors as well.

valyala 842 days ago

While Grafana Agent uses less resources than Prometheus, there is more optimized Prometheus-compatible scraper and router exists - vmagent [1]. I'd recommend you giving Grafana Agent and vmagent the same workload and comparing their resource usage.

P.S. Prometheus itself can also act as a lightweight agent, which collects metrics and forwards them to the configured remote storage [2].

[1] https://docs.victoriametrics.com/vmagent/

[2] https://prometheus.io/blog/2021/11/16/agent/

AeroNotix 844 days ago

That's just kicking the can over to an object storage API instead of managing disks.

camel_gopher 844 days ago

And comes with all the downsides of Prometheus as well

dengolius 843 days ago

Could you pelase elaborate more on "comes with all the downsides of Prometheus"?

uaas 844 days ago

Or, you can use Thanos, the de-facto standard with the biggest OSS community.

dengolius 843 days ago

Thanos/Mimir community doesn't help to resolve configuration routine or even bigger resource consumption for a huge setup.

uaas 843 days ago

Bigger resource consumption of what exactly? Leaf Prometheus instances or the Thanos/Mimir stack compared to VictoriaMetrics? Have you seen a large scale migration between the two, with actual numbers?

valyala 842 days ago

A few of interesting real-world large-scale migrations are highlighted at https://docs.victoriametrics.com/casestudies/

hosh 844 days ago

Not everything emits wide events. Maybe you can get the entire application layer like that, but there is also value in logs and metrics emitted from the rest of the infra stack.

To be fair, you could probably store and represent everything as wide events and build visualization tools out of that that can combine everything together, even if they are sourced from something else.

lelandbatey 844 days ago

Wide events seem to be "structured logs with focused schemas" (maybe also published in a special way beyond writing to stdout) but most places I've worked would call that "logging" not "wide events".

The reasons we don't use them for everything are as others in the thread say: it's expensive. Metrics (just the numbers, nothing else) can be compressed and aggregated extremely efficiently, hence cheaply. Logs are more expensive due to their arbitrary contents.

It's all due to expense really.

isburmistrov 844 days ago

Columnar storage stores data very efficiently, too - because it compresses data of a similar nature (columns). Check e.g. ClickHouse on this matter: https://clickhouse.com/docs/en/about-us/distinctive-features, https://clickhouse.com/blog/working-with-time-series-data-an...

So I wouldn't say that events are "expensive" while metrics are "cheap" - both depend on the actual implementation, and events can be cheap too.

And so of course if you have to optimise things, you would need to drop some information you pass to the events, but you would need to do the same for metrics (reduce the number of metrics emitted, reduce the prometheus labels,...).

adql 844 days ago

If you have small pre-defined sets of events in data structures that compress well. That is not the case for any real system.

> And so of course if you have to optimise things, you would need to drop some information you pass to the events, but you would need to do the same for metrics (reduce the number of metrics emitted, reduce the prometheus labels,...).

Those are entirely different orders of magnitude both when it comes to size and how much usefulness you lose. In modern storage backends like Victoriametrics a counter gonna cost you around byte per metric per probe. And as you emit them periodically, that is essentially independent of incoming traffic

Capturing the requests into event/trace/whatever other name they gave to logs this month is many times that and is multiplied by traffic.

isburmistrov 844 days ago

> Those are entirely different orders of magnitude both when it comes to size and how much usefulness you lose. In modern storage backends like Victoriametrics a counter gonna cost you around byte per metric per probe. And as you emit them periodically, that is essentially independent of incoming traffic

I thought this argument was about whether wide events can be used for metrics or metrics is a completely different concept. If we want to emulate metrics in events, we would also make them periodically independently of the traffic. Like emit them once in a while. Pretty much like Prometheus scraping works

hagen1778 843 days ago

Storing telemetry efficiently is only part of what Monitoring is supposed to do. The other part is querying: ad-hoc queries, dashboards, alerting queries executed each 15s or so. For querying to work fast, there has to be an efficient index or multiple indexes depending on the query. Since you referred ClickHouse as efficient columnar storage, please see what makes it different from a time series database - https://altinity.com/wp-content/uploads/2021/11/How-ClickHou...

isburmistrov 843 days ago

And yet people use ClickHouse quite effectively for this very problem, see the comment here: https://news.ycombinator.com/item?id=39549218

There are also time-series databases out there that are OK with high cardinality: https://questdb.io/blog/2021/06/16/high-cardinality-time-ser...

hagen1778 843 days ago

> And yet people use ClickHouse quite effectively for this very problem

There is no doubt that ClickHouse is a super-fast database. No one stops you from using it for this very problem. My point is that specialized time series databases will outperform ClickHouse.

> There are also time-series databases out there that are OK with high cardinality

So does this blog say that tolerance to cardinality means that QuestDB indexes only one of the columns in the data generated by this benchmark?

TSDBs like Prometheus, VictoriaMetrics or InfluxDB will perform filtering by any of the labels with equal speed, because this is how their index works. Their users don't need to think about the schema or about which column should be present in the filter.

But in ClickHouse and, apparently, in QuestDB, you need to specify a column or list of columns for indexing (the fewer columns, the better). If the user's query doesn't contain the indexed column in the filter - the query performance will be poor (full scan).

See like this happened in another benchmarketing blogpost from QuestDB - https://telegra.ph/No-QuestDB-is-not-Faster-than-ClickHouse-...

flaminHotSpeedo 844 days ago

The whole point of wide events is recording an arbitrary set of key value pairs. How do you propose storing that in a columnar datastore?

phillipcarter 844 days ago

I can't speak for others, but at Honeycomb that's what we do. There's some details in this blog post that might be interesting: https://www.honeycomb.io/blog/why-observability-requires-dis...