Hacker News new | ask | show | jobs
by benjaminwootton 1201 days ago
The benefit of “in process” isn’t really clicking for me.

We had Pandas or similar if we wanted to load and transform some data in memory. SQL is nicer than Pandas APIs but not sure that’s a killer feature?

If we have a lot of data and multiple people working with it then it makes sense to centralise it in a database or warehouse where it’s then easy to access via SQL anyway.

We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart. There’s a cost benefit here because we don’t need to run a server which could be significant.

It’s nicely implemented but I’m not seeing a big gap that it fills?

5 comments

Just to agree with @orthozerox, it's not a 'killer feature' but about lowering the friction.

Lots of users: (a) mentally align better with SQL than pandas APIs, regardless of whether they know both or not (b) want decent performance on their analyses, which they aren't getting from pandas, and won't get from many OLTP-databases they're using over-the-wire (c) want ease of accessing parquet and csv locally and remotely with minimal development overhead. it's super simple with duckdb.

Nonetheless, some of the other things you pointed out are some tradeoffs. We're building a serverless cloud capability at motherduck on top of duckdb in order to address some of these and optimize compute and storage based on data locality, bandwidth and the need for collaboration amongst multiple people.

Focusing on "in process" for the moment, I believe that where duckdb can really shine is in what I like to call last-mile-analytics.

Sure, the complete and most up to date version of your data can be stored in a warehouse/database but when you're potentially slicing/dicing/filtering/sorting/exploring/munging data, it can get quite expensive to have your warehouse/database servicing these requests. Even if it's a cloud/modern warehouse like Snowflake. This would be especially true if you're doing analytics on a non-OLAP database.

With duckdb you could pull down the subset of data you're working on from your warehouse/db/datalake, and then perform the last-minute-analytics "in-process". That might be in your notepad, browser (WASM), etc... As a result you can expect some pretty amazing query performance since it's all happening locally, on a subset of the entire data that you've selected.

Then of course if you still wanted to use pandas you can point it at duckdb and allow pandas to fill in the deficiencies of SQL (while still potentially pushing-down some SQL to duckdb). Then of course you can take those dataframes and push them right back into duckdb instead of writing back out disk.

> We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart.

If you're files on S3 in a sensible form (hive, iceberg, etc), then you can also use duckdb to pull only the data you need from your bucket and work on it locally.

Well, I know SQL much better than I know either Scala+Spark or Python+Pandas. DuckDB is also much easier to set up, being a self-contained JAR. I also like how having a DB file means I don't have to worry about losing in-memory dataframes or leaving random temp files all over the filesystem.

In short, it's not about having a single killer feature, but about being a low-friction solution to my problems.

To me it does fill a big gap.

It’s a local columnar engine that I can use inside a Jupyter notebook. This lowers my cost of iteration tremendously.

Yes I can query data from Postgres and munge with Pandas.

But what if I need to iterate on a large set of parquet files (mine is 200gb on my local machine, Hive partitioned, over a billion records) and munge them with complex SQL with a high perf engine? And seamlessly join with other smaller local datasets (there are always smaller datasets that contain metadata) in CSV, Pandas and JSON format in the same SQL statement?

This is a surprisingly common use case in a lot of data science work and prior to DuckDB you could not do it easily, ergonomically or quickly with a single tool. The authors of DuckDB talked to lots of data scientists to learn their pain points and the final product shows that they really listened well.

To me this is a lot simpler than Pandas as it’s just SQL but it’s SQLite with a columnar engine etc etc