Hacker News new | ask | show | jobs
by zeitlupe 1246 days ago
I do not fully get the speed argument. If the dataset is small'ish, it does not significantly matter (given you only use the vectorized functions, of course). And if it's big data, I do not use pandas but Spark/a cloud data warehouse solution. For this reason I also do not get the use case for duckdb.

Beyond this, a dataframe api/syntax, compared to SQL syntax, is to me way easier to follow and to debug.

3 comments

The use case for DuckDB is the that the vast majority of analytics aren't on "big data", but also aren't "small-ish". If your data fits on a single machine, DuckDB will allow you to query it at great speed. I disagree that the speed "doesn't significantly matter".

There is work being done on a dataframe API frontend for DuckDB, if you prefer that interface.

> I do not fully get the speed argument

The way I see the speed improvements is this: fast processes run faster, moderate-to-long processes (that are still awkward and large in Python, but not enough to justify the shift to Spark) now run in noticeably faster time, and the threshold for “what I need to use Spark for, shifts a long way up the scale. That is, you can now do more, with the same amount of compute.