Hacker News new | ask | show | jobs
by RobinL 1261 days ago
Author here. Since I wrote this, Arrow seems to be be more and more pervasive. As a data engineer, the adoption of Arrow (and parquet) as a data exchange format has so much value. It's amazing how much time me and colleagues have spent on data type issues that have arisen from the wide range of data tooling (R, Pandas, Excel etc. etc.). So much so that I try to stick to parquet, using SQL where possible to easily preserve data types (pandas is a particularly bad offender for managing data types).

In doing so, I'm implicitly using Arrow - e.g. with Duckdb, AWS Athena and so on. The list of tools using Arrow is long! https://arrow.apache.org/powered_by/

Another interesting development since I wrote this is DuckDB.

DuckDB offers a compute engine with great performance against parquet files and other formats. Probably similar performance to Arrow. It's interesting they opted to write their own compute engine rather than use Arrow's - but I believe this is partly because Arrow was immature when they were starting out. I mention it because, as far as I know, there's not yet an easy SQL interface to Arrow from Python.

Nonetheless, DuckDB are still Arrow for some of its other features: https://duckdb.org/2021/12/03/duck-arrow.html

Arrow also has a SQL query engine: https://arrow.apache.org/blog/2019/02/04/datafusion-donation...

I might be wrong about this - but in my experience, it feels like there's more consensus around the Arrow format, as opposed to the compute side.

Going forward, I see parquet continuing on its path to becoming a de facto standard for storing and sharing bulk data. I'm particularly excited about new tools that allow you to process it in the browser. I've written more about this just yesterday: https://www.robinlinacre.com/parquet_api/, discussion: https://news.ycombinator.com/item?id=34310695.

2 comments

Thanks for sharing your insights. Any comments on Feather vs Parquet? If we don't need to support tools that can only interact with Parquet, how will Feather pan out as a Parquet alternative (or Feather can't be such alternative at all)?
I recently looked into this as well. Specifically how the two formats differ. As it stands right now the “Feather” file format seems to be a synonym for the Arrow IPC file format or “Arrow files” [0]. There should be basically no overhead while reading into the arrow memory format [1]. Parquet files on the other hand are stored in a different format and therefore occur some overhead while reading into memory but offer more advanced mechanism for on disk encoding and compression [1].

As far as I can tell the main trade-off seems to be around deserialization overhead vs on disk file size. If anyone has more information or experience with the topic I'd love to hear!

[0] https://arrow.apache.org/faq/#what-about-the-feather-file-fo... [1] https://arrow.apache.org/faq/#what-is-the-difference-between...

EDIT:

More information: https://news.ycombinator.com/item?id=34324649

This is also my understanding - see https://news.ycombinator.com/item?id=34324649
Thanks! Just stumbled across your comment as well.
Since you know a bunch about this, I'm going to ask you a question that I was about to research: If I have a dataset in memory in Arrow, but I want to cache it to disk to read back in later, what is the most efficient way to do that at this moment in time? Is it to write to parquet and read the parquet back into memory, or is there a more efficient way to write the native Arrow format such that it can be read back in directly? I think this sounds kind of like Flight, except that my understanding is that is intended for moving the data across a network rather than temporally across a disk.
I'm not an expert in the nuts and bolts of Arrow, but I think you have two options:

- Save to feather format. Feather format is essentially the same thing as the Arrow in-memory format. This is uncompressed and so if you have super fast IO, it'll read back to memory faster, or at least, with minimal CPU usage.

- Save to compressed parquet format. Because you're often IO bound, not CPU bound, this may read back to memory faster, at the expense of the CPU usage of decompressing.

On a modern machine with a fast SSD, I'm not sure which would be faster. If you're saving to remote blob storage e.g. S3, parquet will almost certainly be faster.

See also https://news.ycombinator.com/item?id=34324649

Thanks! Exactly what I was looking for. I'll do some benchmarking of these two options for my workload.
You're probably looking for the Arrow IPC format [1], which writes the data in close to the same format as the memory layout. On some platforms, reading this back is just an mmap and can be done with zero copying. Parquet, on the other hand, is a somewhat more complex format and there will be some amount of encoding and decoding on read/write. Flight is an RPC framework that essentially sends Arrow data around in IPC format.

[1] https://arrow.apache.org/docs/python/ipc.html