| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wenc 1200 days ago

To me it does fill a big gap.

It’s a local columnar engine that I can use inside a Jupyter notebook. This lowers my cost of iteration tremendously.

Yes I can query data from Postgres and munge with Pandas.

But what if I need to iterate on a large set of parquet files (mine is 200gb on my local machine, Hive partitioned, over a billion records) and munge them with complex SQL with a high perf engine? And seamlessly join with other smaller local datasets (there are always smaller datasets that contain metadata) in CSV, Pandas and JSON format in the same SQL statement?

This is a surprisingly common use case in a lot of data science work and prior to DuckDB you could not do it easily, ergonomically or quickly with a single tool. The authors of DuckDB talked to lots of data scientists to learn their pain points and the final product shows that they really listened well.