| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jagtesh 1510 days ago

First of all, thanks for sharing this OP! So glad to see a way to query a df using SQL without further transformation.

Arrow has been truly revolutionary in this regard, providing a solid in-memory data format (with performant APIs in many languages) for interchange between different engines and even formats.

You can go from ORC to Parset to CSV on a local FS or S3.

With DuckDB, it’s like you can build your own AWS Athena at likely a fraction of the cost. Now if only someone would integrate vaex with DuckDB, it will make your powerful Apple Silicon machines a compelling alternative to running a full fledged Spark/Hadoop cluster.

1 comments

mywaifuismeta 1510 days ago

> With DuckDB, it’s like you can build your own AWS Athena at likely a fraction of the cost

Isn't the whole purpose of Athena to scale to large amounts of data that don't fit into memory? How does duckdb fit in here? I thought it's an in-memory database?

link

1egg0myegg0 1510 days ago

DuckDB is not only an in memory DB - it can persist data and handle larger than RAM workloads. We are expanding the number of operators that can handle larger than memory data, but it is a key design goal!

link

lmeyerov 1510 days ago

Yep, dask_sql is probably a better comparison here (and similarly arrow-friendly). We've been having pretty good experiences here, though clearly young. For in-process, we're sticking with pandas/cudf (GPU dataframes), and likewise still Arrow for IO interop, so easy hand-off.

link