| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alexisread 1739 days ago

I think the best way of looking at this is arrow-format = feather-format

I think the format layouts are the same, the main difference between them being compression. Compressed data can in many cases be faster than uncompressed data to read/scan, (https://stackoverflow.com/questions/48083405/what-are-the-di...) so the cpu transform above is simply an uncompress step, which is notably simpler than what would go on with Sqlite to transform the data to a sqlite structure.

What we're looking at here in this test is a direct-data-access pattern (DDA), which is important as you can avoid ETL caching steps eg. Parquet->Postgres (which have an ingestion time) if you can access the data quick enough for your use case, and if the data is on say s3, you can have multiple (parallel) readers onto the same data rather than a connection pool for databases.

It also allows joining different larger-than-memory datasets efficiently, and avoids much of the infrastructure costs for something like Presto (Athena has per-query costs instead of infrastructure costs).

What I'd have liked to see, would be a Vaex/DuckDB benchmark, as the main differentiation between them appears to be SQL vs df/linq/dyplr/Rx semantics.