Hacker News new | ask | show | jobs
by orthoxerox 1201 days ago
I think I write this under every article about DuckDB, but it's become an indispensable tool for me. I used to abuse Excel because going from Excel to a script to process some data was too much friction, but with DuckDB the friction is gone: loading CSV and Parquet (and now JSON) files is a snap, you can create and persist any tables you want, the SQL dialect has lots of useful sugar.
5 comments

I've just added support for duckdb to the free SQL tool I make: https://www.timestored.com/qstudio/help/duckdb-sql-editor It allows click to open, browsing tables etc. If you have some time, I would really appreciate feedback from a real user, other than myself.
Tried to open DuckDB file with one table with 890K rows and about 30 columns. Process stalled, had to kill it (MacOS). DuckDB cli opens it in a snap.
:( Thanks for trying. The only thing I can think is perhaps a version conflict (requires 0.70) but I would have expected it to say, rather than freeze. I will try to recreate the problem. Thanks.
Yes, I think I have 0.60, good point. I'll update the version and try again later.
The benefit of “in process” isn’t really clicking for me.

We had Pandas or similar if we wanted to load and transform some data in memory. SQL is nicer than Pandas APIs but not sure that’s a killer feature?

If we have a lot of data and multiple people working with it then it makes sense to centralise it in a database or warehouse where it’s then easy to access via SQL anyway.

We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart. There’s a cost benefit here because we don’t need to run a server which could be significant.

It’s nicely implemented but I’m not seeing a big gap that it fills?

Just to agree with @orthozerox, it's not a 'killer feature' but about lowering the friction.

Lots of users: (a) mentally align better with SQL than pandas APIs, regardless of whether they know both or not (b) want decent performance on their analyses, which they aren't getting from pandas, and won't get from many OLTP-databases they're using over-the-wire (c) want ease of accessing parquet and csv locally and remotely with minimal development overhead. it's super simple with duckdb.

Nonetheless, some of the other things you pointed out are some tradeoffs. We're building a serverless cloud capability at motherduck on top of duckdb in order to address some of these and optimize compute and storage based on data locality, bandwidth and the need for collaboration amongst multiple people.

Focusing on "in process" for the moment, I believe that where duckdb can really shine is in what I like to call last-mile-analytics.

Sure, the complete and most up to date version of your data can be stored in a warehouse/database but when you're potentially slicing/dicing/filtering/sorting/exploring/munging data, it can get quite expensive to have your warehouse/database servicing these requests. Even if it's a cloud/modern warehouse like Snowflake. This would be especially true if you're doing analytics on a non-OLAP database.

With duckdb you could pull down the subset of data you're working on from your warehouse/db/datalake, and then perform the last-minute-analytics "in-process". That might be in your notepad, browser (WASM), etc... As a result you can expect some pretty amazing query performance since it's all happening locally, on a subset of the entire data that you've selected.

Then of course if you still wanted to use pandas you can point it at duckdb and allow pandas to fill in the deficiencies of SQL (while still potentially pushing-down some SQL to duckdb). Then of course you can take those dataframes and push them right back into duckdb instead of writing back out disk.

> We can query files on S3 with it and have the processing locally, but then we have network latency because compute and storage are further apart.

If you're files on S3 in a sensible form (hive, iceberg, etc), then you can also use duckdb to pull only the data you need from your bucket and work on it locally.

Well, I know SQL much better than I know either Scala+Spark or Python+Pandas. DuckDB is also much easier to set up, being a self-contained JAR. I also like how having a DB file means I don't have to worry about losing in-memory dataframes or leaving random temp files all over the filesystem.

In short, it's not about having a single killer feature, but about being a low-friction solution to my problems.

To me it does fill a big gap.

It’s a local columnar engine that I can use inside a Jupyter notebook. This lowers my cost of iteration tremendously.

Yes I can query data from Postgres and munge with Pandas.

But what if I need to iterate on a large set of parquet files (mine is 200gb on my local machine, Hive partitioned, over a billion records) and munge them with complex SQL with a high perf engine? And seamlessly join with other smaller local datasets (there are always smaller datasets that contain metadata) in CSV, Pandas and JSON format in the same SQL statement?

This is a surprisingly common use case in a lot of data science work and prior to DuckDB you could not do it easily, ergonomically or quickly with a single tool. The authors of DuckDB talked to lots of data scientists to learn their pain points and the final product shows that they really listened well.

To me this is a lot simpler than Pandas as it’s just SQL but it’s SQLite with a columnar engine etc etc
You can also do this with AlaSQL.js which apart from being able to run SQL against JSON and also works with CSV and XSLS, just include it using a script tag or import it as a node module.
Any chance you’ve tried clickhouse local? I was thinking it might be a good fit but haven’t used duckdb at all so I might be missing out on big differences.
No, I haven't, but it looks like it's a standalone console application, while DuckDB is in-process, like SQLite, and lives inside its JDBC driver. This means I can use it inside any compatible GUI and get stuff like schema browsing and IntelliSense out of the box. Since I have DBeaver open all day anyway, DuckDB is always a tab away.
Clickhouse local is very fast and good for many purposes but DuckDB is equally fast (or faster in some cases but this is a moving target) and supports a wider range of SQL that people like me often use.

DuckDB is great for data scientists and people who need to run complex analytic queries in their Jupyter notebooks and their Python prod code on local or S3 hosted data.

What’s the workflow with JDBC? Say you connect Tableau to it. How would you populate it with data?
My workflow is that I have a connection to c:\temp\scratchpad.db in DBeaver that I populate via

  create table some_data as select * from 'c:\temp\whatever.csv'
or

  create table other_data as select * from read_parquet('c:\temp\000000_0')

which I then can transform using SQL and export the result into CSV, Parquet or SQLite when needed.
I'm recent convert too. I used to work with SQLite for querying datasets but for my usecase DuckDB is much faster plus CLI is nicer to work with.