Hacker News new | ask | show | jobs
by chrisjc 1695 days ago
Did a little searching, but didn't find much about DuckDB and Arrow.

> DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

Does DuckDB use Arrow as its in-memory format? If so, that's pretty awesome.

I'm hearing about DuckDB for the first time today and it's already leaving an impression on me.

2 comments

> Does DuckDB use Arrow as its in-memory format?

I don't think it's the case, but they serve the same purpose anyway: inject the CSV or parquet files you got from the data wharehouse into something you can use for analytics.

Memory-based storage is super fast and fine until your datasets reach a terabyte. So you have a few solutions left:

1. You fire a fresh postgres and you store your data there. It's going to take quite a bit of time to inject those 600M lines into the instance.

2. You have some kind of tool (DataFusion) that allow you to run SQL queries on those parquet files without going through a painful copy/insert process. Performances matter a bit, but not as much because you're already avoiding a full dataset copy and a large-instance expense in the process. Even a 40% perf hit vs Postgres is absolutely acceptable, because you're already winning on both sides (total execution time, financial expenses).

It's not the same. Data fusion comes with ballista, with the goal of replacing spark for many usages. It also supports JSON and Avro
Yes, it's not the same, but they serve the same purpose. It's, honestly, not important if Datafusion or DuckDB are using the arrow memory layout or not. What matters is their ability to run SQL queries (or Map-Reduce workloads) on CSV/Parquet files _WITHOUT COPYING THEM_.

If you start comparing them to solutions that copy datasets, you haven't understood what problem they are solving. For that problem, use postgresql or bigquery.

What have I not understood ? Please enlighten me.
DuckDB developer here, DuckDB does not use Arrow as its in-memory format directly but uses something relatively comparable and has interfaces for (quickly) converting data back and forth to Arrow. Expect a blog post on this soon!