|
Talking about each of these a bit more. Multi-tenant: the data sources that we deal with often have a field in them that describes the end-customer tenant. Often there are other fields that describe sub-teams inside of those tenants. There are often even multiple different ways to describe tenancy off of the same data. For example, in the ad tech world, there are publishers (the websites and mobile apps that show you ads) and there are advertisers (the people who want to run the ad) and there are marketplaces (the people who run the auctions) and there are an umpteenth level of intermediaries between each of these sub-divisions. When you want to understand what's going on in a marketplace, you want to be able to cut and aggregate the data across any and all of these different axes at the same time. Query pattern: when you define a product based on data, that product is generally not going to suddenly wake up tomorrow and do a brand new query you've never seen or heard of before. You have to actually add code to your product to make it do that query. The things that will happen brand new tomorrow is those queries that have been pre-defined will get new filters added to them (new values will always be showing up, new tenants, new stuff). Additionally, one set of users of your product might be looking at usage by tenant, another one might be looking at usage by team and yet another one might be looking at usage by geo region, this means that the dimensions looked at will differ. This is what I mean when I say that the shape is the same, but boolean logic and dimensions looked at changes. Having an SLA is also important for any product experience. When you are painting a UI for someone, they just want the UI painted inside of their expectations and don't care about what it means technically to make that happen. These 3 things are all meaningfully contrasted against, e.g. Data Warehouse style workloads. With a Data Warehouse, the primary users are the business analysts. Those business analysts tend to be looking for a SQL interface, it is their job to write SQL statements, to join those tables and dig ever deeper into the ever-deepening data warehouse or lake or whatever other metaphor people want to have. On top of Data Warehouses, there are various visualization layers that allow you to define dashboards to try to target a wider audience, but at the end of the day those run against a data warehouse and the data warehouse is fundamentally built with the persona of the data analyst in mind. This persona is generally always going to come up with an ever more complex query to answer whatever the newest flavor of the month question happens to be. As the person writes a more complex query, they are okay with it taking longer (of course, faster is always better, but for something that will be run once and never again, who cares if it takes a bit longer) and care much more that it is guaranteed to finish. I.e. they do not actually have a query SLA, highly variant query timings is totally fine and acceptable, instead they require that any query that they write must be executable. |
So, when you say "Not for joins", Druid has joins, we do joins, just the other day, I saw a 50m row dimension table joined against a multi-billion row fact table in a few hundred millis. I regularly see and advocate for customers (I work for Imply, the commercial entity most steeped in Druid) to do various queries that leverage "joins". But, these are almost always in the context of powering an analytics application, I would not tend to recommend that someone use Druid to take multiple billion row data sets and conduct a cartesian product to generate a trillion-row data set, that's not Druid's immediate sweet spot.
When you say "not for sql DISTINCT" this is also worthy of discussion. Almost every Druid deployment I'm aware of regularly does count distinct style calls. The thing with count distinct is that, depending on how it is computed, the query SLAs that we tend to be supporting can become difficult to meet. Even with Druid, the naive implementation of count distinct will tend to be able to fit inside of tight latency bands for small cardinalities of like a few million. But, as you get into the billions, the naive calculation of unique requires deduplicating billions of values and that requires shuffling a good chunk of data around. Given that we work with multi-tenant deployments a lot, it is very common for the "tail" to have low cardinalities and the "whales" to have very high cardinalities (note, when we say "high cardinality" in Druid land, the numbers we are imagining are closer to billions than millions). For the product that our customer would be delivering, they need a consistent, predictable experience for any size tenant, be they whale or tail. This is where approximate count distinct comes in and is very commonly leveraged. With an approximate count distinct, we are able to provide tight latency bands around even count distinct queries, so that's why you will see approximate count distinct talked about in the context of Druid.
When you say "limited for high variability columns", while I can understand where the previous two are coming from, this one actually just baffles me. I might be not understanding the words correctly, but I'm interpretting "high variability" as "high cardinality", if that is wrong, please correct me. That said, I am unsure why this would be true as one of Druid's core strength is in dealing with high cardinalities, be they specific columns or high cardinality because of a combination of columns. As an example, I will use Druid's own metrics-emitting capabilities. On every query that Druid runs, Druid emits its own form of "span" data (we started this before OpenTelemetry and all of that stuff was really a thing, so let's not go down that rabbit hole) about the run of that query, these are independent metrics about the timing of the query at each individual layer of the processing (i.e. across all nodes in the distributed system as well as various different points of processing inside of various of the different processes). This means that Druid generates a new queryId for every query that comes into it. Our most common form of gaining insight and visibility into these metrics is to flow them back into another Druid instance. Given a cluster that does a consistent 1000qps, this gives a single column that is a cardinality of 86 million per day. Add the fact that there are other dimensions like which host the metric came from, and the total cardinality of this data stream quickly approaches billions a day. We support these use cases without thinking about them on a daily basis. Additionally, talking AdTech again, each auction targets a different user, the user identifiers that you see in any given marketplace very quickly approaches hundreds of millions or billions a day. You look at those across multiple years and you quickly get to total cardinalities in the trillions.