Hacker News new | ask | show | jobs
by Fiahil 1696 days ago
So DuckDB is a kind of DataFusion (https://arrow.apache.org/datafusion/) ?
3 comments

DuckDB developer here, DuckDB has a query engine that can directly query external data formats (stored in CSV, Parquet, Arrow, Pandas, etc) without loading the data directly, but also has its own columnar ACID-compliant storage format.

It can certainly be used in the same manner as DataFusion, but can also serve as a stand-alone database system. DuckDB aims to have much more comprehensive SQL support beyond only SELECT queries.

Thanks for the clarification ! :)

Not an advice, but you should probably consider spinning a secondary product from DuckDB with a sole focus on "reading data from parquet files and running aggregations the most efficiently possible". You can probably skip INSERT, UPDATE, DELETE completely.

There is currently a gap in practical solutions for this pain point. You can use Spark or Airflow, but nothing that comes without a big infra price tag (you can do that with pandas, but you need a large instance to load the entire dataset in memory). I think the right product could even outpace what you currently have with DuckDB.

I don't know what a DataFusion is, or what Arrow is, but DuckDB is, in my understanding (and use of it), an "almost" drop in replacement for SQLite, except with better performance.

Anytime you might want to use SQlite as an actual RDBMS (not as an easy disk fileformat), drop in DuckDB instead and get much better performance. DuckDB is actually designed to support OLAP.

> Anytime you might want to use SQlite as an actual RDBMS (not as an easy disk fileformat), drop in DuckDB instead and get much better performance. DuckDB is actually designed to support OLAP.

Aren't RDBMS good at both OLAP and OLTP while not being the best at each?

Wouldn't SQLite still be better at OLTP?
Did a little searching, but didn't find much about DuckDB and Arrow.

> DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.

Does DuckDB use Arrow as its in-memory format? If so, that's pretty awesome.

I'm hearing about DuckDB for the first time today and it's already leaving an impression on me.

> Does DuckDB use Arrow as its in-memory format?

I don't think it's the case, but they serve the same purpose anyway: inject the CSV or parquet files you got from the data wharehouse into something you can use for analytics.

Memory-based storage is super fast and fine until your datasets reach a terabyte. So you have a few solutions left:

1. You fire a fresh postgres and you store your data there. It's going to take quite a bit of time to inject those 600M lines into the instance.

2. You have some kind of tool (DataFusion) that allow you to run SQL queries on those parquet files without going through a painful copy/insert process. Performances matter a bit, but not as much because you're already avoiding a full dataset copy and a large-instance expense in the process. Even a 40% perf hit vs Postgres is absolutely acceptable, because you're already winning on both sides (total execution time, financial expenses).

It's not the same. Data fusion comes with ballista, with the goal of replacing spark for many usages. It also supports JSON and Avro
Yes, it's not the same, but they serve the same purpose. It's, honestly, not important if Datafusion or DuckDB are using the arrow memory layout or not. What matters is their ability to run SQL queries (or Map-Reduce workloads) on CSV/Parquet files _WITHOUT COPYING THEM_.

If you start comparing them to solutions that copy datasets, you haven't understood what problem they are solving. For that problem, use postgresql or bigquery.

What have I not understood ? Please enlighten me.
DuckDB developer here, DuckDB does not use Arrow as its in-memory format directly but uses something relatively comparable and has interfaces for (quickly) converting data back and forth to Arrow. Expect a blog post on this soon!