Hacker News new | ask | show | jobs
by aprdm 2159 days ago
Can you expand? As someone who maintains a large-ish prometheus/grafana installation on prems I don't know what we're missing! We have a couple of custom metrics that we developed in prometheus and the OSS plugins/dashboards look great.
2 comments

From my perspective at a smaller startup (so YMMV) after switching to Datadog; the first thing I noticed was how _fast_ Datadog was. The queries and graphing capabilities were also really powerful. Or maybe I just didn't know how to use the old tools, but regardless it was super easy to pick up and do things I struggled with in Prometheus/Grafana.

It was also mind-blowing how things were integrated. For example. See a slow request? Click into the APM trace. Notice a service on that trace being slow? Click onto it, see what host it was running on. From there, another button pulls up all the Docker containers running on the host in that point in time. The CPU usage is visualized - and, aha! We forgot to set a CPU limit on one of those other jobs.

Debugging issues like that would've been nearly impossible otherwise, and we had more than a few cases of that.

Yeah that kind of integration seems neat. We use ELK + Prometheus and it does require having Kibana + prometheus open OR building a dashboard in grafana pulling from both sources.

As far as speed, I haven't had the issue with prometheus. We use recorded rules for things that benefit from being pre-computed.

I imagine the UX to be quite different by using a product.

Datadog has incredible integration with clouds (AWS and other), databases (postgresql, mysql, cassandra) and middleware (haproxy, kafka). It can capture all the metrics from all these out of the box with minimal effort, whereas you have to crawl hundreds of broken plugins for prometheus to get one third of that.

If you're using clouds (AWS/Azure/Google). Datadog can capture all the AWS metadata automatically and merge with existing metrics, so you use instance tags and such for searching and filtering. It can also capture AWS metrics like ELB and S3 usage which are hard to get otherwise.

So you simply get all the metrics you need and get them easily (I appreciate that people who haven't worked with these probably can't fathom what they are missing out). There are defaults charts/dashboards that are quite good and available out of the box, whereas grafana is empty out of the box and you're once again forced to crawl for dashboard plugins.

Last but not least. The capabilities to search and visualize in datadog are incredible. To draw any metrics and combination of metrics in different ways and analyze usage. Prometheus can't chart shit. Grafana has limited charting and you're forced to create a dashboard to make one chart, which can't be done because don't have admin permissions.

By the way prometheus doesn't scale. It can reach 1000 or 2000 hosts top and that's the end of it. I've operated it at the limit, some operations get really slow and we had to cut down on tags and some metrics to avoid crashing.

> Datadog has incredible integration with clouds (AWS and other), databases (postgresql, mysql, cassandra) and middleware (haproxy, kafka). It can capture all the metrics from all these out of the box with minimal effort, whereas you have to crawl hundreds of broken plugins for prometheus to get one third of that.

Interesting, I haven't had this experience. I monitor the DBs and middleware you mention and the OSS plugins + OSS grafana boards worked quite out of the box. For what is worth we have around ~20 different technologies for DB and middleware.

We aren't using cloud since we have our own datacenters so there could be a big difference in usage.

As far as prometheus doesn't scale I don't know I agree. We have more than 5k hosts currently on it and is working fine. We do use some strategies like recorded queries and federation which are well documented.

The question would be which plugins exactly? because there are tons of plugins spread over GitHub, more or less working, and they're constantly shifting. I've had maybe 15 integrations working perfectly on datadog in 2 weeks of work. The same on prometheus/grafana could have taken 6 months easily (with few exporters to write from scratch).

The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^

Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags.

Try prometheus_tsdb_head_series and prometheus_tsdb_storage_blocks_bytes or du command on the directory.

Interesting, I have a very small team that maintains a lot of infrastructure including Prometheus. Setting up the monitoring for 5 datacenters, thousands of hosts and hundreds of services was maybe a ~2 weeks effort. We then organically add and remove things but it is maybe a ~4h/week effort.

It give us a lot of value. We were using other solutions before (sensu/nagios) and it is night and day.

All the plugins that we use are listed in the official Prometheus website: https://prometheus.io/docs/instrumenting/exporters/ we haven't looked outside this page.

--

The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^

Yeah this kind of visibility is really lacking on-prems. Storage is something that is hard, at scale, to see what is using what.

--

Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags.

We have around 500 GB of disk dedicated to prometheus server datacenter with a retention policy of 1 month. The VMs have around 16 GB of RAM.

I would say 95% of the hosts only have node exporter. Then 5% of the hosts will have redis/elasticsearch/postgres/mysql/etc exporters.

Also probably around 5% of the hosts have some sort of custom metrics. We have been leveraging the file exporter for writing custom service metrics.

Custom service metrics goes through a merge request process where we see the best way to structure them to avoid big cardinality. We also use recorded rules to pre-compute things that would be expensive to query (think CPU usage across a datacenter)

1 TB of RAM usage sounds insane. It looks like we can horizontally scale the prometheus servers and use some documented features for that.

For our use case it has been a very smooth ride so far!

You're lucky to have a team. I was doing that with another person in our spare time and got 0 hour/week to spend. I only give ourselves a couple weeks to setup new software and tune everything perfectly then it has to be done for good. Will check maybe once a quarter for capacity adjustment or software upgrade. (I also supervise a logging system ingesting 1 TB/day with no supervision).

I should probably say that my experience with datadog goes back as far as 5 years ago. Already had monitoring working perfectly back then, when prometheus didn't exist let alone the exporter plugins! So prometheus is really late and sub par to me. ^^

Looks like we got a similar amount of data in prometheus as you (1TB for 60 days) but with 40% of the hosts. Maybe you have many small VM? Got physical hosts with quad CPU (per CPU metrics) and network interfaces and stuff (couldn't tune the node exporter to ignore disabled interfaces and some useless devices). Check how many distinct timeseries you have, prometheus_tsdb_head_series.

Datadog had amazing support for custom metrics (but watch out for extra billing and cardinality!). Applications can just send metrics to localhost:1234 where the agent is listening, and they're enriched automatically with host information and environment. Magic.

This reminds me, prometheus is broken with its idea of pulling metrics, when metrics should be pushed instead. Applications and hosts have to push metrics when they come online, it's not the responsibility of the metrics storage to know about every goddamn thing running in the company and try to talk to them (can't cross firewall anyway). Prometheus worked okay enough for the last company that was on premise with fixed hosts (weeks or months to move anything physical), but it's de facto broken for the previous company that was on AWS with instances created intraday.

Number of series: 158990 is displayed in the web UI.

We disabled a lot of useless metrics in the node exporter. I think pulling works OK if you have a service discovery mechanism. We hook Prometheus to Consul.

We have a mix of very small VMs and very beefy bare metal.