|
|
|
|
|
by cheddar
1429 days ago
|
|
I apparently wrote a novel that HackerNews claims is too long, so let me try to break it up into parts. Hopefully HN doesn't hate me for it... As the guy who wrote the first lines of code in Druid, I can venture a description of the limitations as I tend to see them. Of course, all systems have limitations, they are built for a purpose. The purpose that Druid was built for was to power an analytics-oriented product. What's a data-oriented product and why am I starting to talk all meta when you are asking for very specific technical limitations? Because understanding the meta leads to a much stronger understanding of fit and purpose than talking about a specific technical blahty-blah. So, an analytics-oriented product is a product that renders a screen for an end-user based on some data/analytics. The very first one of these that we powered was a Digital Advertising focused dashboard that provided visibility into impressions, clicks, conversions and revenue across marketplaces. Other examples are things like metering and billing that show usage and costs attributed all the way down to an individual request level. Yet other examples are product recommendations based on purchasing behaviors, or fraud analysis based on recent purchase history, or observability style use cases viewing the flow of events through a system. The thing that all of these have in common is that they are products that tend to be:
1) Multi-tenant
2) Follow a similar general "pattern" of queries, with highly variable boolean filtering criteria and a consistent need to view the same data across a wide array of different dimensions
3) Have an SLA defined for the product experience that determines a budget for how fast queries must run |
|
Multi-tenant: the data sources that we deal with often have a field in them that describes the end-customer tenant. Often there are other fields that describe sub-teams inside of those tenants. There are often even multiple different ways to describe tenancy off of the same data. For example, in the ad tech world, there are publishers (the websites and mobile apps that show you ads) and there are advertisers (the people who want to run the ad) and there are marketplaces (the people who run the auctions) and there are an umpteenth level of intermediaries between each of these sub-divisions. When you want to understand what's going on in a marketplace, you want to be able to cut and aggregate the data across any and all of these different axes at the same time.
Query pattern: when you define a product based on data, that product is generally not going to suddenly wake up tomorrow and do a brand new query you've never seen or heard of before. You have to actually add code to your product to make it do that query. The things that will happen brand new tomorrow is those queries that have been pre-defined will get new filters added to them (new values will always be showing up, new tenants, new stuff). Additionally, one set of users of your product might be looking at usage by tenant, another one might be looking at usage by team and yet another one might be looking at usage by geo region, this means that the dimensions looked at will differ. This is what I mean when I say that the shape is the same, but boolean logic and dimensions looked at changes.
Having an SLA is also important for any product experience. When you are painting a UI for someone, they just want the UI painted inside of their expectations and don't care about what it means technically to make that happen.
These 3 things are all meaningfully contrasted against, e.g. Data Warehouse style workloads. With a Data Warehouse, the primary users are the business analysts. Those business analysts tend to be looking for a SQL interface, it is their job to write SQL statements, to join those tables and dig ever deeper into the ever-deepening data warehouse or lake or whatever other metaphor people want to have. On top of Data Warehouses, there are various visualization layers that allow you to define dashboards to try to target a wider audience, but at the end of the day those run against a data warehouse and the data warehouse is fundamentally built with the persona of the data analyst in mind. This persona is generally always going to come up with an ever more complex query to answer whatever the newest flavor of the month question happens to be. As the person writes a more complex query, they are okay with it taking longer (of course, faster is always better, but for something that will be run once and never again, who cares if it takes a bit longer) and care much more that it is guaranteed to finish. I.e. they do not actually have a query SLA, highly variant query timings is totally fine and acceptable, instead they require that any query that they write must be executable.