| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aprdm 2160 days ago

> Datadog has incredible integration with clouds (AWS and other), databases (postgresql, mysql, cassandra) and middleware (haproxy, kafka). It can capture all the metrics from all these out of the box with minimal effort, whereas you have to crawl hundreds of broken plugins for prometheus to get one third of that.

Interesting, I haven't had this experience. I monitor the DBs and middleware you mention and the OSS plugins + OSS grafana boards worked quite out of the box. For what is worth we have around ~20 different technologies for DB and middleware.

We aren't using cloud since we have our own datacenters so there could be a big difference in usage.

As far as prometheus doesn't scale I don't know I agree. We have more than 5k hosts currently on it and is working fine. We do use some strategies like recorded queries and federation which are well documented.

1 comments

user5994461 2160 days ago

The question would be which plugins exactly? because there are tons of plugins spread over GitHub, more or less working, and they're constantly shifting. I've had maybe 15 integrations working perfectly on datadog in 2 weeks of work. The same on prometheus/grafana could have taken 6 months easily (with few exporters to write from scratch).

The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^

Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags.

Try prometheus_tsdb_head_series and prometheus_tsdb_storage_blocks_bytes or du command on the directory.

link

aprdm 2160 days ago

Interesting, I have a very small team that maintains a lot of infrastructure including Prometheus. Setting up the monitoring for 5 datacenters, thousands of hosts and hundreds of services was maybe a ~2 weeks effort. We then organically add and remove things but it is maybe a ~4h/week effort.

It give us a lot of value. We were using other solutions before (sensu/nagios) and it is night and day.

All the plugins that we use are listed in the official Prometheus website: https://prometheus.io/docs/instrumenting/exporters/ we haven't looked outside this page.

Yeah this kind of visibility is really lacking on-prems. Storage is something that is hard, at scale, to see what is using what.

We have around 500 GB of disk dedicated to prometheus server datacenter with a retention policy of 1 month. The VMs have around 16 GB of RAM.

I would say 95% of the hosts only have node exporter. Then 5% of the hosts will have redis/elasticsearch/postgres/mysql/etc exporters.

Also probably around 5% of the hosts have some sort of custom metrics. We have been leveraging the file exporter for writing custom service metrics.

Custom service metrics goes through a merge request process where we see the best way to structure them to avoid big cardinality. We also use recorded rules to pre-compute things that would be expensive to query (think CPU usage across a datacenter)

1 TB of RAM usage sounds insane. It looks like we can horizontally scale the prometheus servers and use some documented features for that.

For our use case it has been a very smooth ride so far!

link

user5994461 2160 days ago

You're lucky to have a team. I was doing that with another person in our spare time and got 0 hour/week to spend. I only give ourselves a couple weeks to setup new software and tune everything perfectly then it has to be done for good. Will check maybe once a quarter for capacity adjustment or software upgrade. (I also supervise a logging system ingesting 1 TB/day with no supervision).

I should probably say that my experience with datadog goes back as far as 5 years ago. Already had monitoring working perfectly back then, when prometheus didn't exist let alone the exporter plugins! So prometheus is really late and sub par to me. ^^

Looks like we got a similar amount of data in prometheus as you (1TB for 60 days) but with 40% of the hosts. Maybe you have many small VM? Got physical hosts with quad CPU (per CPU metrics) and network interfaces and stuff (couldn't tune the node exporter to ignore disabled interfaces and some useless devices). Check how many distinct timeseries you have, prometheus_tsdb_head_series.

Datadog had amazing support for custom metrics (but watch out for extra billing and cardinality!). Applications can just send metrics to localhost:1234 where the agent is listening, and they're enriched automatically with host information and environment. Magic.

This reminds me, prometheus is broken with its idea of pulling metrics, when metrics should be pushed instead. Applications and hosts have to push metrics when they come online, it's not the responsibility of the metrics storage to know about every goddamn thing running in the company and try to talk to them (can't cross firewall anyway). Prometheus worked okay enough for the last company that was on premise with fixed hosts (weeks or months to move anything physical), but it's de facto broken for the previous company that was on AWS with instances created intraday.

link

aprdm 2160 days ago

Number of series: 158990 is displayed in the web UI.

We disabled a lot of useless metrics in the node exporter. I think pulling works OK if you have a service discovery mechanism. We hook Prometheus to Consul.

We have a mix of very small VMs and very beefy bare metal.

link

user5994461 2160 days ago

Oh right that's not much, only 3000 metrics per host. I guess we see more scaling problems because we have more metrics and tags (physical hosts with 128 CPU, imagine the per CPU metrics just to begin with). I spent days as well testing all the collectors and tuning things carefully.

That's all very interesting.

link

aprdm 2160 days ago

Sorry I just picked one datacenter, so, 1/5h of the hosts or 15k metrics per host with some rudimentary math!

Our data lives on each datacenter only and then we query cross datacenter via grafana when we need.

link