| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by marcle 1738 days ago

TLDR: Arrow and DuckDB provide fast database aggregates compared with R's RDS format and, to an extent, SQLite.

It is unclear how much functionality is available for Arrow under R: any comments? It would also be interesting to see a similar benchmark for Python, which could include the embedded version of MonetDB -- an R package for MonetDB/e is not yet available.

Edit: amended the TLDR to reflect jhoechtl's and wodenokoto's comments. SQLite provided reasonably memory efficient aggregates.

2 comments

wodenokoto 1738 days ago

I came to a completely different conclusion.

RDS is slow to load, because it has to unzip and read everything into memory. All others are fast to load because they have somesort of index into data on disk (at the cost of being much larger at rest)

Everything else is fast to load compared to RDS, with arrow being the fastest because its index happened to be optimized for the test query.

jhoechtl 1738 days ago

> TLDR: Arrow and DuckDB provide fast and memory efficient database aggregates compared with R's RDS format and SQLite. Arrow and RDS were fast to load.

I nowhere read that the authors come to the conclusion that SQlite does NOT provide fast and memory efficient database aggregates. Instead they conclude

> SQLite and DuckDB files consists in a single large file (3 GB and 2.5 GB each), but the indexes we created allow their respective packages to read a copy of the tables that has just the year and reporter_iso columns, and therefore allows very fast filtering to provide the exact location of what we need to read in the large tables.