Hacker News new | ask | show | jobs
by cmckn 2143 days ago
Prometheus is great. I first heard about it at KubeCon last fall, and kind of shrugged it off as one of those fledgling "cloud native" projects that I probably didn't need or didn't have time to learn. There's actually a lot of adoption, you can find great exporters and grafana dashboards for almost any OSS you're running today. I started collecting metrics from Zookeeper and HBase in about an hour, having never had access to that telemetry before. From the existence of Cortex[1], it seems that Prometheus doesn't scale incredibly well, but I don't think many users will hit these limits.

[1] https://cortexmetrics.io/

3 comments

My Prometheus system is a $10/mo Linode. It collects from 27 other hosts, and at least 100 services distributed across those hosts - doesn't even break a sweat. All the exporters run through a wireguard VPN. Prometheus is great for a small/medium SaaS type environment.
What do you use as a frontend? As far as I could tell grafana free tier doesn’t allow monitoring cluster of servers.
I use Grafana and some custom ones, I have only one Prometheus box so clustering is not a problem I'm having (and likely won't, I can vertical scale a long way for my smallish operation)
You could self host it.
Can I self host for monitoring cluster of servers? Currently I have grafana installed on each of my servers and I am having to monitor them individually. I want a centralised dashboard over telegraf + influxdb.
Why would you install Grafana + Influx on each server instead of one central one?
I haven't spent much time on this but most of the docs were for setting it up on each hosts. Is there a proper tutorial for clusters?

Also I wanted to keep the monitoring unaffected for other servers if one of them go down. If I setup a central server for monitoring then that becomes a single point of failure.

Prometheus "scales" really well, but it does so via segmentation and federation, rather than increasing the size of an e.g. cluster. Some use cases don't fit to that model, so projects like Cortex and Thanos exist.
not vertically at least. the memory usage for indexing has room for improvement. If I read the pprofs correctly, every scrape interval and every remote write allocates huge amounts of memory which is only cleaned up on garbage collection. You can easily need >64 gb ram for tenthousands of time series, otherwise you oom.
Biggest single promethueus server I have access to currently uses almost 64GiB of RAM and ingests about 80000 samples per second. Most of scrape intervals is 60s. It is about 5 000 000 time series. Note that we do have more time series - above server is just a horizontal shard, ingesting just one part of total metrics volume there.
Tens of thousands seems rather low, we are running 3 million series with less than 32GB of RAM and still have room to spare.
I do 15 million on about 64GB average memory. Have you tried recently?
This has not been my experience at all. I'd file that as a bug.
Interesting to say that Prometheus is "fledgling". The project is almost 7 years old and the Google thing on which it is based is ~15 years old.
> I first heard about it at KubeCon last fall, and kind of shrugged it off as one of those fledgling...

I didn’t know the age of the project, because I hadn’t heard of it. That’s why I go on to say that in actuality it has a ton of adoption and I’ve had a great experience with it.