|
|
|
|
|
by chrisjc
1695 days ago
|
|
Did a little searching, but didn't find much about DuckDB and Arrow. > DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. Does DuckDB use Arrow as its in-memory format? If so, that's pretty awesome. I'm hearing about DuckDB for the first time today and it's already leaving an impression on me. |
|
I don't think it's the case, but they serve the same purpose anyway: inject the CSV or parquet files you got from the data wharehouse into something you can use for analytics.
Memory-based storage is super fast and fine until your datasets reach a terabyte. So you have a few solutions left:
1. You fire a fresh postgres and you store your data there. It's going to take quite a bit of time to inject those 600M lines into the instance.
2. You have some kind of tool (DataFusion) that allow you to run SQL queries on those parquet files without going through a painful copy/insert process. Performances matter a bit, but not as much because you're already avoiding a full dataset copy and a large-instance expense in the process. Even a 40% perf hit vs Postgres is absolutely acceptable, because you're already winning on both sides (total execution time, financial expenses).