| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cheddar 1429 days ago

I apparently wrote a novel that HackerNews claims is too long, so let me try to break it up into parts. Hopefully HN doesn't hate me for it...

As the guy who wrote the first lines of code in Druid, I can venture a description of the limitations as I tend to see them. Of course, all systems have limitations, they are built for a purpose. The purpose that Druid was built for was to power an analytics-oriented product. What's a data-oriented product and why am I starting to talk all meta when you are asking for very specific technical limitations? Because understanding the meta leads to a much stronger understanding of fit and purpose than talking about a specific technical blahty-blah.

So, an analytics-oriented product is a product that renders a screen for an end-user based on some data/analytics. The very first one of these that we powered was a Digital Advertising focused dashboard that provided visibility into impressions, clicks, conversions and revenue across marketplaces. Other examples are things like metering and billing that show usage and costs attributed all the way down to an individual request level. Yet other examples are product recommendations based on purchasing behaviors, or fraud analysis based on recent purchase history, or observability style use cases viewing the flow of events through a system.

The thing that all of these have in common is that they are products that tend to be: 1) Multi-tenant 2) Follow a similar general "pattern" of queries, with highly variable boolean filtering criteria and a consistent need to view the same data across a wide array of different dimensions 3) Have an SLA defined for the product experience that determines a budget for how fast queries must run

1 comments

cheddar 1429 days ago

Talking about each of these a bit more.

Multi-tenant: the data sources that we deal with often have a field in them that describes the end-customer tenant. Often there are other fields that describe sub-teams inside of those tenants. There are often even multiple different ways to describe tenancy off of the same data. For example, in the ad tech world, there are publishers (the websites and mobile apps that show you ads) and there are advertisers (the people who want to run the ad) and there are marketplaces (the people who run the auctions) and there are an umpteenth level of intermediaries between each of these sub-divisions. When you want to understand what's going on in a marketplace, you want to be able to cut and aggregate the data across any and all of these different axes at the same time.

Query pattern: when you define a product based on data, that product is generally not going to suddenly wake up tomorrow and do a brand new query you've never seen or heard of before. You have to actually add code to your product to make it do that query. The things that will happen brand new tomorrow is those queries that have been pre-defined will get new filters added to them (new values will always be showing up, new tenants, new stuff). Additionally, one set of users of your product might be looking at usage by tenant, another one might be looking at usage by team and yet another one might be looking at usage by geo region, this means that the dimensions looked at will differ. This is what I mean when I say that the shape is the same, but boolean logic and dimensions looked at changes.

Having an SLA is also important for any product experience. When you are painting a UI for someone, they just want the UI painted inside of their expectations and don't care about what it means technically to make that happen.

These 3 things are all meaningfully contrasted against, e.g. Data Warehouse style workloads. With a Data Warehouse, the primary users are the business analysts. Those business analysts tend to be looking for a SQL interface, it is their job to write SQL statements, to join those tables and dig ever deeper into the ever-deepening data warehouse or lake or whatever other metaphor people want to have. On top of Data Warehouses, there are various visualization layers that allow you to define dashboards to try to target a wider audience, but at the end of the day those run against a data warehouse and the data warehouse is fundamentally built with the persona of the data analyst in mind. This persona is generally always going to come up with an ever more complex query to answer whatever the newest flavor of the month question happens to be. As the person writes a more complex query, they are okay with it taking longer (of course, faster is always better, but for something that will be run once and never again, who cares if it takes a bit longer) and care much more that it is guaranteed to finish. I.e. they do not actually have a query SLA, highly variant query timings is totally fine and acceptable, instead they require that any query that they write must be executable.

link

cheddar 1429 days ago

So, anyway, long pre-amble, but the specific answer is: you can think of Druid's limitations as based in targeting the needs of analytics applications instead of generic data warehousing.

So, when you say "Not for joins", Druid has joins, we do joins, just the other day, I saw a 50m row dimension table joined against a multi-billion row fact table in a few hundred millis. I regularly see and advocate for customers (I work for Imply, the commercial entity most steeped in Druid) to do various queries that leverage "joins". But, these are almost always in the context of powering an analytics application, I would not tend to recommend that someone use Druid to take multiple billion row data sets and conduct a cartesian product to generate a trillion-row data set, that's not Druid's immediate sweet spot.

When you say "not for sql DISTINCT" this is also worthy of discussion. Almost every Druid deployment I'm aware of regularly does count distinct style calls. The thing with count distinct is that, depending on how it is computed, the query SLAs that we tend to be supporting can become difficult to meet. Even with Druid, the naive implementation of count distinct will tend to be able to fit inside of tight latency bands for small cardinalities of like a few million. But, as you get into the billions, the naive calculation of unique requires deduplicating billions of values and that requires shuffling a good chunk of data around. Given that we work with multi-tenant deployments a lot, it is very common for the "tail" to have low cardinalities and the "whales" to have very high cardinalities (note, when we say "high cardinality" in Druid land, the numbers we are imagining are closer to billions than millions). For the product that our customer would be delivering, they need a consistent, predictable experience for any size tenant, be they whale or tail. This is where approximate count distinct comes in and is very commonly leveraged. With an approximate count distinct, we are able to provide tight latency bands around even count distinct queries, so that's why you will see approximate count distinct talked about in the context of Druid.

When you say "limited for high variability columns", while I can understand where the previous two are coming from, this one actually just baffles me. I might be not understanding the words correctly, but I'm interpretting "high variability" as "high cardinality", if that is wrong, please correct me. That said, I am unsure why this would be true as one of Druid's core strength is in dealing with high cardinalities, be they specific columns or high cardinality because of a combination of columns. As an example, I will use Druid's own metrics-emitting capabilities. On every query that Druid runs, Druid emits its own form of "span" data (we started this before OpenTelemetry and all of that stuff was really a thing, so let's not go down that rabbit hole) about the run of that query, these are independent metrics about the timing of the query at each individual layer of the processing (i.e. across all nodes in the distributed system as well as various different points of processing inside of various of the different processes). This means that Druid generates a new queryId for every query that comes into it. Our most common form of gaining insight and visibility into these metrics is to flow them back into another Druid instance. Given a cluster that does a consistent 1000qps, this gives a single column that is a cardinality of 86 million per day. Add the fact that there are other dimensions like which host the metric came from, and the total cardinality of this data stream quickly approaches billions a day. We support these use cases without thinking about them on a daily basis. Additionally, talking AdTech again, each auction targets a different user, the user identifiers that you see in any given marketplace very quickly approaches hundreds of millions or billions a day. You look at those across multiple years and you quickly get to total cardinalities in the trillions.

link

cheddar 1429 days ago

Now, for a technical "limitation" that maybe hasn't been highlighted: Druid believes that all hardware is replaceable. As a rule, Druid does not build a strong affinity between one specific piece of hardware and one specific chunk of data, any such affinity is ephemeral and can change on a whim. Some people call this a "cloud native" architecture. Why is this a limitation? Because it limits one set of optimizations that some other databases take advantage of: ingesting data into a globally-meaningful partitioning scheme. Taking this back to the uniques discussion, if you, for example, want to compute uniques on userId, some databases will allow you to say "I have 5 nodes, put 20% of users on each node partitioned by userId" and then when you query, each of those 5 nodes can compute a local unique number and you can just add those numbers to get the total uniques. This "optimization" for uniques actually tightly couples your database to only have 5 nodes, if you suddenly need to scale up to 10 nodes, you must now re-shard your data across those 10 nodes, which can be tedious. The other problem with this optimization is that it doesn't actually help as much as people think it will, sure it gives you good answers for the userId, but there's often a deviceId field that indicates the device that a user is using. If you partition by userId, it's entirely possible that two different users are using the same device, in which case now your deviceId metrics cannot leverage the same optimization. Instead, in Druid, we have made the choice that the operational decision of scaling up should be equivalent to adding a server and walking away, everything else should just happen automatically. In order to enable this ease of scaling, we must assume that data can move to any server on a whim, which means that a forced, pinned global partitioning is out of the cards.

Summarizing, if you are wondering if Druid is a good fit for your use case, you can think about whether you are build an analytics application or not. Are you building a product or are you giving data to analysts? If you are building a product, then Druid should be a strong contender for helping accelerate your delivery of the product. If you are just trying to give arbitrary SQL access to some analysts, there's a lot of data warehouses out there that will do a great job.

One more thing, and I'll try not to be too commercial, but the internet is full of disinformation these days and Druid has been around long enough to build up a history of things that were maybe once true but are no more. One relatively fast way to evaluate fit for purpose would be to reach out to a commercial vendor (like Imply, remember, I am employed by Imply), discuss what you are trying to accomplish and see if it is a good fit. While vendors are absolutely biased towards trying to sell their software, vendors also hate sinking time and effort into trying to support deployments that are just a bad fit. Additionally, it's a bit easier to see a vendor's biases (and ask them to prove things) than it is to understand and know the biases and experiences of the people who posted stuff on the internets.

link