| HN Mirror

While DuckDB is an exciting and amazing project, I think the world that will open up around it is just as exciting, and these are exactly the kinds of questions that get me excited.

DuckDB is to Snowflake/BigQuery/DataBricks/etc...

what

sqlite is to MySQL/Postgres/Oracle/etc... (let's ignore for the moment that Postgres and Oracle have HTAP modes)

In other words, I don't think DuckDB aims to replace or compete against the big OLAP products/services such as Snowflake, BigQuery, DataBricks. Instead it's a natural and complementary component in the analytical stack.

Of course you'll see in the numerous blogs about how amazing it is for data exploration, wrangling, jupyter, pandas, etc... but personally I think the questions about how it could be used in production use-cases a lot more fascinating.

Data warehouses can become quite expensive to run and operate when you either have to allow

1) front-end analytical applications to connect to them directly to do analytics on the fly, or

2) if you pre-calculate ALL the analytics (whether they're used or not) that are offloaded to a cheaper and "faster" OLTP system.

I'm excited about how DuckDB can sort of bridge these two solutions.

1) Prepare semi-pre-calculated data on your traditional data warehouse. (store in internal table or external table like iceberg, delta, etc)

2) Ingest the subsets of this data needed for different production workloads in to DuckDB for last-mile analytics and slicing/dicing.

DuckDb could either interact with your

1) push-down queries to internal tables via their database scanners (arrow across the wire. postgres_scanner, hopefully more to come), or

2) prune external tables (iceberg, delta, etc) to get the subsets (interact with catalogs) of semi-pre-calculated analytical data on demand. Think intelligently partitioned parquet files on S3.

Last-mile analytics, pagination, etc can all be done within DuckDb either directly on your browser (WASM) or on the edge with something like AWS Lambda. This could and hopefully will result in reducing the cost of keeping data warehouses around to serve up fully pre-calculated analytics to consumers as well as reducing the complexity of your analytics stack/arch.