| HN Mirror

So, anyway, long pre-amble, but the specific answer is: you can think of Druid's limitations as based in targeting the needs of analytics applications instead of generic data warehousing.

So, when you say "Not for joins", Druid has joins, we do joins, just the other day, I saw a 50m row dimension table joined against a multi-billion row fact table in a few hundred millis. I regularly see and advocate for customers (I work for Imply, the commercial entity most steeped in Druid) to do various queries that leverage "joins". But, these are almost always in the context of powering an analytics application, I would not tend to recommend that someone use Druid to take multiple billion row data sets and conduct a cartesian product to generate a trillion-row data set, that's not Druid's immediate sweet spot.

When you say "not for sql DISTINCT" this is also worthy of discussion. Almost every Druid deployment I'm aware of regularly does count distinct style calls. The thing with count distinct is that, depending on how it is computed, the query SLAs that we tend to be supporting can become difficult to meet. Even with Druid, the naive implementation of count distinct will tend to be able to fit inside of tight latency bands for small cardinalities of like a few million. But, as you get into the billions, the naive calculation of unique requires deduplicating billions of values and that requires shuffling a good chunk of data around. Given that we work with multi-tenant deployments a lot, it is very common for the "tail" to have low cardinalities and the "whales" to have very high cardinalities (note, when we say "high cardinality" in Druid land, the numbers we are imagining are closer to billions than millions). For the product that our customer would be delivering, they need a consistent, predictable experience for any size tenant, be they whale or tail. This is where approximate count distinct comes in and is very commonly leveraged. With an approximate count distinct, we are able to provide tight latency bands around even count distinct queries, so that's why you will see approximate count distinct talked about in the context of Druid.

When you say "limited for high variability columns", while I can understand where the previous two are coming from, this one actually just baffles me. I might be not understanding the words correctly, but I'm interpretting "high variability" as "high cardinality", if that is wrong, please correct me. That said, I am unsure why this would be true as one of Druid's core strength is in dealing with high cardinalities, be they specific columns or high cardinality because of a combination of columns. As an example, I will use Druid's own metrics-emitting capabilities. On every query that Druid runs, Druid emits its own form of "span" data (we started this before OpenTelemetry and all of that stuff was really a thing, so let's not go down that rabbit hole) about the run of that query, these are independent metrics about the timing of the query at each individual layer of the processing (i.e. across all nodes in the distributed system as well as various different points of processing inside of various of the different processes). This means that Druid generates a new queryId for every query that comes into it. Our most common form of gaining insight and visibility into these metrics is to flow them back into another Druid instance. Given a cluster that does a consistent 1000qps, this gives a single column that is a cardinality of 86 million per day. Add the fact that there are other dimensions like which host the metric came from, and the total cardinality of this data stream quickly approaches billions a day. We support these use cases without thinking about them on a daily basis. Additionally, talking AdTech again, each auction targets a different user, the user identifiers that you see in any given marketplace very quickly approaches hundreds of millions or billions a day. You look at those across multiple years and you quickly get to total cardinalities in the trillions.