| > I'm only interested in time series databases for use by developers and operations to store and retrieve data that pertains to the health and performance of the services they build and operate. Everything in this blog will judge the entries based on their suitability for that task. That is a very particular problem, in which the data storage is a minimal [yet important] aspect of the full system. You're probably going the wrong route if you're trying to redesign your own and you'll only realize that way too late when you'll have to design your own metrics collection, own graphing, own alerting, own... The standard proven open-source stack: collectd/statd (metrics collection + whipser/graphite (storage) + grafana (cute graphs and dashboards). The latest fad is to replace graphite with prometheus (which is better in some aspects but has it own fault). Both these open source tools will satisfy your purpose. HARDCORE LIMITATIONS: Both these open source tools are entirely single node. There is no form of sharding nor high availability nor horizontal scaling. (Rules of thumb: Should be fine up to 100 hosts and applications. Then get ready to throw big hardware and tune retention aggressively.) --- Some quick maths: 8 bytes per metrics * every 5 second = 967 kB per metrics over the week 967 kB per metric * 100 metrics per host * 100 hosts = ~10 GB per week for high precision Any of the parameter can spiral by tenfold (depending on the setup, retention, hosts, metrics per app...). That means going straight into TB range and scaling issues where one node is simply out of the question. --- It's pretty clear that the open source solutions don't scale and are hard to maintain... so what's next when we outgrow them? Switch to the latest generation of monitoring tools. The two best solutions are datadog and signalfx. They both accept custom metrics from your app. And... oh wait I just noticed that dataloop.io is a new SaaS solution trying to compete with them. Oops :D |
Build vs buy is an age old discussion. You won't convince anyone to switch from one side to the other. There will continue to be people like you and me who would prefer to buy, and others who want to build and run it themselves. As you have found out I don't need to be convinced as I started a company to address the issue of there being no good options to buy at the time. In most cases, for monitoring micro services, I'd buy a SaaS solution. I founded Dataloop 3 years ago so not really a new startup any more. We're past Series A and starting to grow.
It's true that we compete with Datadog and SignalFX in that area although our real competition is open source with 90% of the addressable market using older tools like Nagios etc. As the shift to the cloud and micro services happens I'm sure it won't be a winner takes all market. Dataloop tends to focus on the enterprise end of the scale whereas Signalfx is more developer focussed and Datadog is more operations and SME.
When you say best I'd argue that's subjective. Signalfx charges by the metric and that gets very expensive. Datadog limits you to 100 metrics per node with an agent based pricing model. Dataloop uses per node pricing that's much cheaper with unlimited metric volume. We're aiming to keep the costs extremely low by investing in highly efficient backend storage.
The reason people are moving away from Graphite to InfluxDB and Prometheus is the dimensional data model. Graphite simply isn't as powerful. Similarly, StatsD aggregates down to the service and doesn't help pinpoint the outlier. Prometheus collects all metrics in their raw format far more efficiently and will let you instantly drill down into what is causing the issue.
To answer your question about what's next after you outgrow open source solutions that don't scale.. well that was kind of the point of the blog! DalmatinerDB scales to millions of metrics per second on a single node and linearly as you add additional nodes. It isn't exactly hard to maintain either as it's based on Riak Core.
I guess the final thing to say was that this wasn't really an advert for Dataloop. Our business model doesn't depend on selling database features. Unlike other SaaS companies we're happy to release the work done on our time series database for free and available as open source.
Why would we do that? Mostly because it's fun to do open source stuff. Also because hiring Erlang developers is pretty hard and this gives me an excuse to talk at conferences where they hang out.
We've had a team of people working on this stuff now for over a year and as you've mentioned no open source time series databases really scale. It's a problem we've solved and are giving away for free. I must be really bad at conveying that message in the blog.