|
|
|
|
|
by talolard
1218 days ago
|
|
I use Duckdb as a data scientist / analyst. It’s amazing for working with large data locally, because it is very fast and there is almost 0 overhead for use. For example, I helped an Israeli ngo analyze retailer pricing data (supermarkets must publish prices every day by law). Pandas chokes
on data that large, Postgres can handle it but aggregations are very slow. Duckdb is lightning fast. The traditional alternative I’m familiar with is spark, but it’s
such a hassle to setup, expensive to run and not as fast on these kinds of use cases. I will note that familiarity with Parquet and how columnar engines work is helpful. I have gotten tremendous performance increases when storing the data in a sorted
manner in a parquet file, which is ETL overhead. Still, it’s a very powerful and convenient tool for working with large datasets locally |
|