| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dijit 2143 days ago

Grafana truly is best in class, but I have strong reservations about Prometheus.

I really want to like it, it’s just so _easy_, publish a little webpage with your metrics and Prometheus takes care of the rest. Lovely.

But I often find that the cardinality of the data is substantially lower than even the defaults of alternatives (influxdb has 1s and even Zabbix has 5s).

Not to mention the lost writes (missing data points) which have no logged explanation.

All of this, however, was in my homelab, which, while unconstrained in resources lacks a lot of the fit and finish of a prod system.

I also take pause with the architecture; it’s not meant to scale. It’s written on the tin so it’s not like I’m picking fault, but when you’re building a dashboard that sucks in data from 25 different Prometheus data sources, it becomes difficult to run functions like SUM(), because the keys may be out of sync causing some really ugly and inaccurate representations of data.

Everything about the design (polling, single database) tells me that it was designed primarily to sit alongside something small. It could never handle the tens of millions of data points per second that I ingest(ed) at my (now previous) job.

But it has a lot of hype, and maybe I’m holding it wrong.

6 comments

ecnahc515 2143 days ago

> Erything about the design (polling, single database) tells me that it was designed primarily to sit alongside something small.

Prometheus is designed to be "functionally sharded". You shouldn't be running one "mega prometheus". Often it's something like 1 Prometheus per-team, depending on the amount of metrics each produces.

You can use federation at lower resolutions or a one of the distributed setups (Thanos/Cortex) if you want to avoid extra storage or lower resolution that federation entails.

> But I often find that the cardinality of the data is substantially lower than even the defaults of alternatives

Not to distract, but I think you meant resolution, not cardinality. Cardinality is the metadata like labels/dimensions. Resolution is the granularity in the time.

AlphaSite 2143 days ago

Doesn’t it also have cardinality issues?

halfmatthalfcat 2143 days ago

You get high enough (hitting a metric with >100k unique labels), queries become unmanageable and incredibly slow when backed by the stock datastore (tsdb). However there are backing datastores (TimescaleDB, InfluxDB, VictoriaMetrics, etc) that ingest Prometheus metrics and enable higher cardinality.

unixhero 2143 days ago

This is where I fall off.

Is Prometheus a DB that (can) forward data to another DB?

halfmatthalfcat 2143 days ago

Prometheus is a data format but it's also a "suite" of tools on top of that data format.

Usually what happens is your app, db, whatever will expose metrics (http request status, average response time, etc) in the Prometheus format which is then scraped by the Prometheus ingestor. The ingestor stores those metrics in a (short-term) datastore called TSDB. Prometheus also ships with a little web UI as well that can query those metrics in TSDB.

However Prometheus allows scraping from the ingestor (usually federation) or pushing into an external datastore that is usually more performant than TSDB.

So when people say "Prometheus", they usually are talking about the suite of tools however practically it's really just the format of the metrics data.

unixhero 2142 days ago

Thanks for the explanation

ecnahc515 2142 days ago

Sort of. https://prometheus.io/docs/prometheus/latest/storage/#remote...

roskilli 2143 days ago

And M3DB too if you want to cluster and scale out, vs cloud store

RhodesianHunter 2143 days ago

There's no real way to develop a metrics system without cardinality issues. Where you draw the line depends on the backing database but they're all fairly constrained.

ashtonkem 2143 days ago

Honestly, it’s incredibly impressive how far you can push them as is. We send a lot of data into these systems.

Legogris 2143 days ago

Victoriametrics[0] is API-compatible with Prometheus but also a horizontally scalable, distributed and persisting timeseries database (cf influxdb). Together with vmagent it essentially becomes a HA drop-in replacement (almost) for Prometheus.

[0]: https://github.com/VictoriaMetrics/VictoriaMetrics

dewey 2143 days ago

These might be a good read if you are considering it:

- https://www.robustperception.io/evaluating-performance-and-c...

- https://medium.com/@valyala/evaluating-performance-and-corre...

ashtonkem 2143 days ago

My understanding is that Prometheus is designed for you to deploy multiple instances within your company, rather than deploying a limited number of instances for the company or division. So I would reasonably run a Prometheus instance by myself or with my neighboring teams rather than depending on a centralized instance run by $OPS.

scaryclam 2143 days ago

This is how we use it and it works well. Other teams are also free to use whatever else they want and if we need an "overview" it's pretty easy to upstream certain metrics elsewhere (say, a centralised system run by ops) to collate together.

Being able to also control which metrics are important to my team vs the wider team is a BIG bonus of this sort of decentralised system.

ashtonkem 2143 days ago

As one of my directs pointed out, it also reduces the "blast radius" for any mistakes around metrics. If I mess up and send orders of magnitude too many metrics to Prometheus, the worst case is that I'll lose my own metrics since it's only my own instance. The pull nature of Prometheus also helps here. But with something like Graphite, I can accidentally overload the StatsD relays and ruin everyone's metrics, which is bad.

EdwardDiego 2143 days ago

We've used Thanos to aggregate multiple Prometheus (Promethii?) across our clusters to enable us to scale, each Prometheus deals only with a subset of scrape targets.

Biggest issue I've had was an app that was accidentally publishing several thousand metrics which caused the default scrape timeout of 15s to kick in.

(It was publishing Kafka lag per consumer group per topic, which was fine and dandy, until someone released an app that runs about 500 instances at peak, and scaled up and down frequently, and had incorporated the pod id into the consumer group names, which led to Kafka tracking many, many, many consumer groups. Given that the consumers were low value anyway, we now just exclude them from having their lag tracked.)

fnord123 2143 days ago

>Promethii

Prometheuses.

ii is for latin words. Prometheus is/was Greek. I guess you could use Prometheoí but it would quickly derail any conversation. :)

rollulus 2143 days ago

There was a talk on Promcon 2016 about this subject [1]. The conclusion was: in English, indeed, Prometheuses. In Ancient Greek: Prometheis.

[1]: https://www.youtube.com/watch?v=B_CDeYrqxjQ

unixhero 2143 days ago

I kind of like the Ancient Greek version.

skohan 2143 days ago

> Grafana truly is best in class

Really? Recently we've been playing with Chronograf with InfluxDB and most people find it a lot nicer to work with than Grafana (specifically because it makes discoverability a lot nicer)

_xrjp 2143 days ago

For our modest cloud infra, InfluxData TICK (InfluxDB, Kapacitor, Chronograf and Telegraf) stack has fitted exactly with our needs. We really like its folding building-blocks, interoperability and yeah... easy discoverability and configuration. But also its very convenient InfluxQL which lets us customize reports with ease on InfluxDB.

VectorLock 2143 days ago

Recent Grafana's Explore interface is much nicer.

jtl999 2143 days ago

> Not to mention the lost writes (missing data points) which have no logged explanation.

FWIW I've had similar issues with MySQL backed Zabbix before.