Hacker News new | ask | show | jobs
by cheddar 1428 days ago
So, anyway, long pre-amble, but the specific answer is: you can think of Druid's limitations as based in targeting the needs of analytics applications instead of generic data warehousing.

So, when you say "Not for joins", Druid has joins, we do joins, just the other day, I saw a 50m row dimension table joined against a multi-billion row fact table in a few hundred millis. I regularly see and advocate for customers (I work for Imply, the commercial entity most steeped in Druid) to do various queries that leverage "joins". But, these are almost always in the context of powering an analytics application, I would not tend to recommend that someone use Druid to take multiple billion row data sets and conduct a cartesian product to generate a trillion-row data set, that's not Druid's immediate sweet spot.

When you say "not for sql DISTINCT" this is also worthy of discussion. Almost every Druid deployment I'm aware of regularly does count distinct style calls. The thing with count distinct is that, depending on how it is computed, the query SLAs that we tend to be supporting can become difficult to meet. Even with Druid, the naive implementation of count distinct will tend to be able to fit inside of tight latency bands for small cardinalities of like a few million. But, as you get into the billions, the naive calculation of unique requires deduplicating billions of values and that requires shuffling a good chunk of data around. Given that we work with multi-tenant deployments a lot, it is very common for the "tail" to have low cardinalities and the "whales" to have very high cardinalities (note, when we say "high cardinality" in Druid land, the numbers we are imagining are closer to billions than millions). For the product that our customer would be delivering, they need a consistent, predictable experience for any size tenant, be they whale or tail. This is where approximate count distinct comes in and is very commonly leveraged. With an approximate count distinct, we are able to provide tight latency bands around even count distinct queries, so that's why you will see approximate count distinct talked about in the context of Druid.

When you say "limited for high variability columns", while I can understand where the previous two are coming from, this one actually just baffles me. I might be not understanding the words correctly, but I'm interpretting "high variability" as "high cardinality", if that is wrong, please correct me. That said, I am unsure why this would be true as one of Druid's core strength is in dealing with high cardinalities, be they specific columns or high cardinality because of a combination of columns. As an example, I will use Druid's own metrics-emitting capabilities. On every query that Druid runs, Druid emits its own form of "span" data (we started this before OpenTelemetry and all of that stuff was really a thing, so let's not go down that rabbit hole) about the run of that query, these are independent metrics about the timing of the query at each individual layer of the processing (i.e. across all nodes in the distributed system as well as various different points of processing inside of various of the different processes). This means that Druid generates a new queryId for every query that comes into it. Our most common form of gaining insight and visibility into these metrics is to flow them back into another Druid instance. Given a cluster that does a consistent 1000qps, this gives a single column that is a cardinality of 86 million per day. Add the fact that there are other dimensions like which host the metric came from, and the total cardinality of this data stream quickly approaches billions a day. We support these use cases without thinking about them on a daily basis. Additionally, talking AdTech again, each auction targets a different user, the user identifiers that you see in any given marketplace very quickly approaches hundreds of millions or billions a day. You look at those across multiple years and you quickly get to total cardinalities in the trillions.

1 comments

Now, for a technical "limitation" that maybe hasn't been highlighted: Druid believes that all hardware is replaceable. As a rule, Druid does not build a strong affinity between one specific piece of hardware and one specific chunk of data, any such affinity is ephemeral and can change on a whim. Some people call this a "cloud native" architecture. Why is this a limitation? Because it limits one set of optimizations that some other databases take advantage of: ingesting data into a globally-meaningful partitioning scheme. Taking this back to the uniques discussion, if you, for example, want to compute uniques on userId, some databases will allow you to say "I have 5 nodes, put 20% of users on each node partitioned by userId" and then when you query, each of those 5 nodes can compute a local unique number and you can just add those numbers to get the total uniques. This "optimization" for uniques actually tightly couples your database to only have 5 nodes, if you suddenly need to scale up to 10 nodes, you must now re-shard your data across those 10 nodes, which can be tedious. The other problem with this optimization is that it doesn't actually help as much as people think it will, sure it gives you good answers for the userId, but there's often a deviceId field that indicates the device that a user is using. If you partition by userId, it's entirely possible that two different users are using the same device, in which case now your deviceId metrics cannot leverage the same optimization. Instead, in Druid, we have made the choice that the operational decision of scaling up should be equivalent to adding a server and walking away, everything else should just happen automatically. In order to enable this ease of scaling, we must assume that data can move to any server on a whim, which means that a forced, pinned global partitioning is out of the cards.

Summarizing, if you are wondering if Druid is a good fit for your use case, you can think about whether you are build an analytics application or not. Are you building a product or are you giving data to analysts? If you are building a product, then Druid should be a strong contender for helping accelerate your delivery of the product. If you are just trying to give arbitrary SQL access to some analysts, there's a lot of data warehouses out there that will do a great job.

One more thing, and I'll try not to be too commercial, but the internet is full of disinformation these days and Druid has been around long enough to build up a history of things that were maybe once true but are no more. One relatively fast way to evaluate fit for purpose would be to reach out to a commercial vendor (like Imply, remember, I am employed by Imply), discuss what you are trying to accomplish and see if it is a good fit. While vendors are absolutely biased towards trying to sell their software, vendors also hate sinking time and effort into trying to support deployments that are just a bad fit. Additionally, it's a bit easier to see a vendor's biases (and ask them to prove things) than it is to understand and know the biases and experiences of the people who posted stuff on the internets.