Hacker News new | ask | show | jobs
by azimafroozeh 326 days ago
Arrow is primarily an in-memory format, while Parquet is commonly used for on-disk storage. Typically, data is stored in Parquet and then read into memory as Arrow. FastLanes is a new on-disk file format, comparable to Parquet, but offers around 40% better compression and faster decoding thanks to its data-parallel encoding design.

Disclaimer: I'm the first author of the paper.

1 comments

What about Feather? This is on my to do list, but I thought that Feather was a file format based on Arrow: https://docs.pola.rs/api/python/stable/reference/api/polars....

This is referenced in the link above. https://arrow.apache.org/docs/python/ipc.html

Unfortunately I'm stuck with CSV at work for now.

Feather appears to just be block compressed Arrow IPC [1]. Lightweight compression techniques generally achieve two orders of magnitude faster random access compared to block compression. That’s one of the benefits of formats like FastLanes, Vortex, DuckDB native, etc. DuckDB has a good blog post about it here: https://duckdb.org/2022/10/28/lightweight-compression.html

[1]: https://arrow.apache.org/docs/python/feather.html