| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zeitlupe 1246 days ago
	I do not fully get the speed argument. If the dataset is small'ish, it does not significantly matter (given you only use the vectorized functions, of course). And if it's big data, I do not use pandas but Spark/a cloud data warehouse solution. For this reason I also do not get the use case for duckdb. Beyond this, a dataframe api/syntax, compared to SQL syntax, is to me way easier to follow and to debug.

3 comments

orlp 1246 days ago

The use case for DuckDB is the that the vast majority of analytics aren't on "big data", but also aren't "small-ish". If your data fits on a single machine, DuckDB will allow you to query it at great speed. I disagree that the speed "doesn't significantly matter".

There is work being done on a dataframe API frontend for DuckDB, if you prefer that interface.

link

FridgeSeal 1246 days ago

> I do not fully get the speed argument

The way I see the speed improvements is this: fast processes run faster, moderate-to-long processes (that are still awkward and large in Python, but not enough to justify the shift to Spark) now run in noticeably faster time, and the threshold for “what I need to use Spark for, shifts a long way up the scale. That is, you can now do more, with the same amount of compute.

link

zeitlupe 1238 days ago

Some more thought in the same direction: https://dataengineeringcentral.substack.com/p/whats-all-the-...

link