| HN Mirror

Now, for a technical "limitation" that maybe hasn't been highlighted: Druid believes that all hardware is replaceable. As a rule, Druid does not build a strong affinity between one specific piece of hardware and one specific chunk of data, any such affinity is ephemeral and can change on a whim. Some people call this a "cloud native" architecture. Why is this a limitation? Because it limits one set of optimizations that some other databases take advantage of: ingesting data into a globally-meaningful partitioning scheme. Taking this back to the uniques discussion, if you, for example, want to compute uniques on userId, some databases will allow you to say "I have 5 nodes, put 20% of users on each node partitioned by userId" and then when you query, each of those 5 nodes can compute a local unique number and you can just add those numbers to get the total uniques. This "optimization" for uniques actually tightly couples your database to only have 5 nodes, if you suddenly need to scale up to 10 nodes, you must now re-shard your data across those 10 nodes, which can be tedious. The other problem with this optimization is that it doesn't actually help as much as people think it will, sure it gives you good answers for the userId, but there's often a deviceId field that indicates the device that a user is using. If you partition by userId, it's entirely possible that two different users are using the same device, in which case now your deviceId metrics cannot leverage the same optimization. Instead, in Druid, we have made the choice that the operational decision of scaling up should be equivalent to adding a server and walking away, everything else should just happen automatically. In order to enable this ease of scaling, we must assume that data can move to any server on a whim, which means that a forced, pinned global partitioning is out of the cards.

Summarizing, if you are wondering if Druid is a good fit for your use case, you can think about whether you are build an analytics application or not. Are you building a product or are you giving data to analysts? If you are building a product, then Druid should be a strong contender for helping accelerate your delivery of the product. If you are just trying to give arbitrary SQL access to some analysts, there's a lot of data warehouses out there that will do a great job.

One more thing, and I'll try not to be too commercial, but the internet is full of disinformation these days and Druid has been around long enough to build up a history of things that were maybe once true but are no more. One relatively fast way to evaluate fit for purpose would be to reach out to a commercial vendor (like Imply, remember, I am employed by Imply), discuss what you are trying to accomplish and see if it is a good fit. While vendors are absolutely biased towards trying to sell their software, vendors also hate sinking time and effort into trying to support deployments that are just a bad fit. Additionally, it's a bit easier to see a vendor's biases (and ask them to prove things) than it is to understand and know the biases and experiences of the people who posted stuff on the internets.