| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by talolard 1218 days ago

I use Duckdb as a data scientist / analyst. It’s amazing for working with large data locally, because it is very fast and there is almost 0 overhead for use.

For example, I helped an Israeli ngo analyze retailer pricing data (supermarkets must publish prices every day by law). Pandas chokes on data that large, Postgres can handle it but aggregations are very slow. Duckdb is lightning fast.

The traditional alternative I’m familiar with is spark, but it’s such a hassle to setup, expensive to run and not as fast on these kinds of use cases.

I will note that familiarity with Parquet and how columnar engines work is helpful. I have gotten tremendous performance increases when storing the data in a sorted manner in a parquet file, which is ETL overhead.

Still, it’s a very powerful and convenient tool for working with large datasets locally

2 comments

stinos 1218 days ago

So I'm not super familiar with different databases, but do understand the basics and do know how to work wit data with e.g. pandas, and do think I understand what Duckdb is useful for, but what I'm still completely missing is: how do I get data in Duckdb? I.e. how did you get that data into Duckdb? Or: suppose I have a device producing sensor data, normally I'd connect to some MySQL endpoint somehow and tell it to insert data. How does one do that with Duckdb? Or is the idea rather that you construct your Duckdb first by getting data from somewhere else (like the MySQL db in my example)?

link

RyEgswuCsn 1216 days ago

My experience has been that most of the time you don’t tell DuckDB to insert data. One is expected to point DuckDB to an existing data file (parquet, csv, json with this new release, etc.) and either import/copy the data into DuckDB tables, or issue SQL queries directly against the file.

Think of it as a SQL engine for ad-hoc querying larger-than-memory datasets.

link

talolard 1217 days ago

You can do both ways but the latter is the more useful one. Duckdb is designed to read the data very fast and to operate on it fast. So you load a csv/json/parquet and then “create table” and Duckdb lays out the data in a way that makes it fast to read.

But you(I) wouldn’t use it like a standard db where stuff gets constantly written in, rather like a tool to effectively analyze data that’s already somewhere

link

talolard 1218 days ago

Oh just took a look at the release notes, the new ability to write hive partitioned data with the partition by clause makes etl stuff much easier

link