|
|
|
|
|
by user5994461
2159 days ago
|
|
The question would be which plugins exactly? because there are tons of plugins spread over GitHub, more or less working, and they're constantly shifting. I've had maybe 15 integrations working perfectly on datadog in 2 weeks of work. The same on prometheus/grafana could have taken 6 months easily (with few exporters to write from scratch). The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^ Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags. Try prometheus_tsdb_head_series and prometheus_tsdb_storage_blocks_bytes or du command on the directory. |
|
It give us a lot of value. We were using other solutions before (sensu/nagios) and it is night and day.
All the plugins that we use are listed in the official Prometheus website: https://prometheus.io/docs/instrumenting/exporters/ we haven't looked outside this page.
--
The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^
Yeah this kind of visibility is really lacking on-prems. Storage is something that is hard, at scale, to see what is using what.
--
Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags.
We have around 500 GB of disk dedicated to prometheus server datacenter with a retention policy of 1 month. The VMs have around 16 GB of RAM.
I would say 95% of the hosts only have node exporter. Then 5% of the hosts will have redis/elasticsearch/postgres/mysql/etc exporters.
Also probably around 5% of the hosts have some sort of custom metrics. We have been leveraging the file exporter for writing custom service metrics.
Custom service metrics goes through a merge request process where we see the best way to structure them to avoid big cardinality. We also use recorded rules to pre-compute things that would be expensive to query (think CPU usage across a datacenter)
1 TB of RAM usage sounds insane. It looks like we can horizontally scale the prometheus servers and use some documented features for that.
For our use case it has been a very smooth ride so far!