Hacker News new | ask | show | jobs
by xgbi 1051 days ago
Anyone with experience scaling Prometheus horizontally ? We are reaching the limits of our instance, memory and cpu wise, and I’m yet to choose between scaling it myself with sharding or using thanos/Victoria/cortex.
3 comments

If you want to query across the whole data set, use one of the other things. Prometheus has a "federation" option but there's not been any active work on it for years. It's basically the definition of Thanos - take a bunch of Prometheus and query across them. Plus long-term storage in S3.

VictoriaMetrics, Cortex, Mimir are centralised data stores that accept data from multiple Prometheus, but you could also run headless agents scraping and sending the data.

Note if you are on a version before 2.44, try upgrading. Prometheus slimmed down a bit.

[I am a Prometheus and Mimir maintainer]

I've beenthrough this song and dance. Did months-long PoCs (with live data, running next to the then-production Prometheus deployment) of Thanos, Cortex and Victoria Metrics.

VM won hands down on pretty much all counts. It's easy and simple to operate and monitor, it scales really well and you can plan around how you want to partition and scale each component, it's incredibly cheap to run as performace is superior to the others, even when backed by spinning HDDs vs the other solutions on SSDs.

It's especially easy to operate on Kubernetes using their CRDs and operators.

I am not associated with Victoria Metrics in any way, just a happy user and sysadmin who ran it for a few years.

VictoriaMetrics was recommended to me by a contractor and I've been very happy with it as well. It does have an option to push in metrics, which I intend to use with transient environments like CI jobs and the like, though I haven't gotten there yet.
Yep, we used to use that in a few places. CI jobs, batch processes, etc. Prometheus has PushGateway which we also used before migrating to VM, but it had certain drawbacks (can't recall exactly what, sorry) that the new solution didn't.
Yeah, whatever you do, don’t use Mimir.

Operational nightmare, expensive to run, various parts of the entirely-too-many moving pieces it contains broke all the time and the performance was…unimpressive.

I‘ve heard that some people manage to run this thing successfully, and power to them, but I want nothing more to do with it.

Just save yourself the pain and use Victoria Metrics. Added benefit: you get an implementation of a rate function that’s actually correct.

I have been running Mimir reasonably well. When it comes to performance, what exactly did you find unimpressive? Interested to know any pitfalls or pain points you have encountered so far?