The tool indeed does a very good job from developer's point of view but when we see the end to end aspects of Datadog then the feeling changes. Previously when I was working as a DevOps Engineer, I remember how much our Head of Infrastructure was pissed with the shady licensing and pricing model of Datadog. Missing of detailed itemized billing, lack of proper access control (allowing who team can use which feature, can publish what metrics, etc.) makes the tool a pain in the long run. We even started to look for affordable alternatives to it.
Yeah, amazing, incredible tool but the billing is obfuscated to the point you almost wonder if they're intentionally trying to make it impossible to understand. Though once you realize there are a tons of hacks to abuse their billing methods and pay much less, it becomes apparent that it's just a case of Hanlon's razor.
Yep, the access controls are limited. Can't finely controls who can raise alerts to where for example, so it's not "enterprise" in a sense because enterprise is all about tightly controlling what the peons can view and do.
In my experience alerting often came down to some departments trying to push some shit to some other departments. I personally avoid working in monitoring/alerting for that reason, it's just human problems and dysfunctional organizations, nothing any tool can help with.
It's crazy how easy it was to setup and cover all our infra (AWS, ELB, postgresql, cassandra, kafka, haproxy, nginx, etc...). The tool paid for itself with the infra optimizations we could find in the first month of usage.
It makes me sad when I'm forced to work with graphite/prometheus/grafana in my newer company. These can't gather half the metrics and their charting capabilities are so bad in comparison.
Can you expand? As someone who maintains a large-ish prometheus/grafana installation on prems I don't know what we're missing! We have a couple of custom metrics that we developed in prometheus and the OSS plugins/dashboards look great.
From my perspective at a smaller startup (so YMMV) after switching to Datadog; the first thing I noticed was how _fast_ Datadog was. The queries and graphing capabilities were also really powerful. Or maybe I just didn't know how to use the old tools, but regardless it was super easy to pick up and do things I struggled with in Prometheus/Grafana.
It was also mind-blowing how things were integrated. For example. See a slow request? Click into the APM trace. Notice a service on that trace being slow? Click onto it, see what host it was running on. From there, another button pulls up all the Docker containers running on the host in that point in time. The CPU usage is visualized - and, aha! We forgot to set a CPU limit on one of those other jobs.
Debugging issues like that would've been nearly impossible otherwise, and we had more than a few cases of that.
Yeah that kind of integration seems neat. We use ELK + Prometheus and it does require having Kibana + prometheus open OR building a dashboard in grafana pulling from both sources.
As far as speed, I haven't had the issue with prometheus. We use recorded rules for things that benefit from being pre-computed.
I imagine the UX to be quite different by using a product.
Datadog has incredible integration with clouds (AWS and other), databases (postgresql, mysql, cassandra) and middleware (haproxy, kafka). It can capture all the metrics from all these out of the box with minimal effort, whereas you have to crawl hundreds of broken plugins for prometheus to get one third of that.
If you're using clouds (AWS/Azure/Google). Datadog can capture all the AWS metadata automatically and merge with existing metrics, so you use instance tags and such for searching and filtering. It can also capture AWS metrics like ELB and S3 usage which are hard to get otherwise.
So you simply get all the metrics you need and get them easily (I appreciate that people who haven't worked with these probably can't fathom what they are missing out). There are defaults charts/dashboards that are quite good and available out of the box, whereas grafana is empty out of the box and you're once again forced to crawl for dashboard plugins.
Last but not least. The capabilities to search and visualize in datadog are incredible. To draw any metrics and combination of metrics in different ways and analyze usage. Prometheus can't chart shit. Grafana has limited charting and you're forced to create a dashboard to make one chart, which can't be done because don't have admin permissions.
By the way prometheus doesn't scale. It can reach 1000 or 2000 hosts top and that's the end of it. I've operated it at the limit, some operations get really slow and we had to cut down on tags and some metrics to avoid crashing.
> Datadog has incredible integration with clouds (AWS and other), databases (postgresql, mysql, cassandra) and middleware (haproxy, kafka). It can capture all the metrics from all these out of the box with minimal effort, whereas you have to crawl hundreds of broken plugins for prometheus to get one third of that.
Interesting, I haven't had this experience. I monitor the DBs and middleware you mention and the OSS plugins + OSS grafana boards worked quite out of the box. For what is worth we have around ~20 different technologies for DB and middleware.
We aren't using cloud since we have our own datacenters so there could be a big difference in usage.
As far as prometheus doesn't scale I don't know I agree. We have more than 5k hosts currently on it and is working fine. We do use some strategies like recorded queries and federation which are well documented.
The question would be which plugins exactly? because there are tons of plugins spread over GitHub, more or less working, and they're constantly shifting. I've had maybe 15 integrations working perfectly on datadog in 2 weeks of work. The same on prometheus/grafana could have taken 6 months easily (with few exporters to write from scratch).
The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^
Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags.
Try prometheus_tsdb_head_series and prometheus_tsdb_storage_blocks_bytes or du command on the directory.
Interesting, I have a very small team that maintains a lot of infrastructure including Prometheus. Setting up the monitoring for 5 datacenters, thousands of hosts and hundreds of services was maybe a ~2 weeks effort. We then organically add and remove things but it is maybe a ~4h/week effort.
It give us a lot of value. We were using other solutions before (sensu/nagios) and it is night and day.
The cloud does make a difference. Just seeing the daily S3 usage per bucket was life changing. Immediately found that backups were not expiring after a while as they should, costing more and more money. ^^
Yeah this kind of visibility is really lacking on-prems. Storage is something that is hard, at scale, to see what is using what.
--
Do you know how many metrics you are ingesting in prometheus? storage size? and how many tags per host? We were reaching 1 TB of memory usage (mmap) on our server with 1500 hosts. Prometheus was literally grinding to a halt or crashing, was forced to cut down some metrics and stick to the absolute minimum tags.
We have around 500 GB of disk dedicated to prometheus server datacenter with a retention policy of 1 month. The VMs have around 16 GB of RAM.
I would say 95% of the hosts only have node exporter. Then 5% of the hosts will have redis/elasticsearch/postgres/mysql/etc exporters.
Also probably around 5% of the hosts have some sort of custom metrics. We have been leveraging the file exporter for writing custom service metrics.
Custom service metrics goes through a merge request process where we see the best way to structure them to avoid big cardinality. We also use recorded rules to pre-compute things that would be expensive to query (think CPU usage across a datacenter)
1 TB of RAM usage sounds insane. It looks like we can horizontally scale the prometheus servers and use some documented features for that.
For our use case it has been a very smooth ride so far!
+1 for Datadog. I can't comprehend how they scale to accommodate the data we send them, let alone everyone else. Literally tens of thousands of data points per second, around the clock, for years. Thousands and thousands of unique keys. And it's all queryable instantaneously. All for about 1/4 of a developer's salary.
Datadog is nice as a user. It's pretty terrible as a person who cares about the budget.
Their pricing is pretty ridiculous at times and their sales people are often way over aggressive. You have to pay extra for containers on a host. They also make it impossible to keep users from consuming additional paid features.
I like having Datadog when I need to debug, but I'm pretty sick of the dark patterns and surprise bills. I'll probably go with Prometheus in my next greenfield.
Agreed that the downside is billing and their sales are asshole. I learned to be insensitive to pushy sales when negotiating for software, it's the same shit with half the suppliers. The tool is really great though and worth paying for.
I dread having to go back to Prometheus because the company is too cheap to pay for proper tooling (datadog) and developers would rather write their own time series database for their resume.
Try Instana, doesn’t have that insane pricing model. It is also more useful, in that it doesn’t focus on „graph porn“ but instead genuinely useful insights like automated alerts and root cause analysis.
-1 datadog. Unless you have an SRE on payroll, it's obtuse, expensive as hell and you can never tell what exactly you're paying for. I don't like it much. It took a LOT of man hours to set up.
Get on the phone with them. I setup regular calls and had them explain each and every charge and where it came from. They would even waive billing on a feature i wanted to try out so i could determine baseline cost. I 100% agree the pricing is impossible to understand but they will work with you to find what works from both an ops and budget.
I have to object on the lot of man hours. This was by far the easier thing to setup compared to all the competitors and the open source solution we have. Maybe a week or two to setup everything on all our hosts and capture all the metrics for systems and AWS and databases and custom metrics.
Honeycomb has one trick that it does very well, the high dimensionality analysis. Other than that Honeycomb doesn't do a lot, their alerting is basic, their dashboards are basic, etc.
Datadog doesn't do that specific feature as well (it has alternatives), but it also has so many other features that all tie together very nicely: metrics, logging, events, very good dashboards, analysis notebooks, alerting, SLOs, performance monitoring, trace analysis, security monitoring. It's a really extensive product.
When it comes to general observability, I'm a strong believer that you need a wide range of different views – just logs aren't enough, just metrics aren't enough, etc. I've worked in a team trying to use Prometheus for everything and there was so much friction, whereas with Datadog there has always been a way to achieve something.
I think Honeycomb is a good feature that should be bought by a company like Datadog and integrated into a wider more mature feature set.
Sorry for the late reply. Dimensionality is just the number of dimensions a given record has.
For example, if you have a log line to represent an HTTP request completing it might have a server hostname that processed it, a path, and a request duration – 3 dimensions. High dimensionality is just having lots of dimensions.
Most monitoring systems aren't great at correlating between lots of dimensions, or cost a lot if you want to high cardinality, which high dimensionality contributes to. The cardinality is how many possible options there are for all of the dimensions together (so if you have 3 paths and 2 servers, those two fields have a cardinality of 6, or 6 different places you need to store your request duration for).
When you start getting lots of dimensions and lots of values for each dimension, things start getting expensive and you have to be quite restrained about what you choose to monitor. Maybe you decide not to track duration per server because you hope your servers are roughly the same, whereas you know that URL paths perform differently much of the time.
Honeycomb's great advantage is that they support this high dimensionality/cardinality really well. Not only do they not cost a lot to do it, but they also have a really nice UI for exploring this data and how different bits correlate together, without you having to know up-front what you want to look for. Honeycomb is expensive, but not prohibitively so for some engineering teams.
I'd still recommend Datadog _first_ because it does a lot more for a little less money, but if you're pushing the boundaries of what Datadog is capable of with tracing through distributed systems then Honeycomb enables the next step.
Yeah I heard they are pretty good but too bad they don't have mobile access for me. Do you have a need to access tools like Datadog on your mobile phone?