Hacker News new | ask | show | jobs
by sandGorgon 2149 days ago
same question here - why Feather and why is it named differently ?

Also is Parquet and Arrow the same ? df.to_parquet('df.parquet.gzip', compression='gzip') will not use arrow i presume ? i have to use a separate library to save to parquet using arrow. a bit confused.

1 comments

Graphistry uses parquet as a more stable and thus persistent storage format when folks save data, and arrow for ephemeral internal data where we're ok (and somewhat enjoy) version changes, as that just means code version upgrades. There are performance differences in practice such as parquet having more per-column compression modes built in, making it attractive for colder storage, and arrow for in-memory/rpc/streaming/etc for similar reasons.

RE:Feather -- Arrow itself isn't necessarily a full file format -- you can imagine memory buffers being all over the heap with giant gaps inbetween b/c diff cols generated at diff times -- but in practice folks will indeed serialize to disk consolidated buffers (pa.Table -> write stream -> file). If we couldn't do that, RPC wouldn't work ;-) My understanding of Feather is it standardizes ideas around this consolidation, but we are able save to disk (within versions) without it. We found it more predictable to stick to ~Parquet for storage and Arrow buffer passing for streaming, but now that Feather networking APIs for accelerated bulk transfers may be stablizing, there may be speed advantages to using it over manual buffer streaming (and still stick w/ Parquet for persistent files).

Arrow<>Parquet conversion is super fast b/c of the co-design around similar concepts: both using record batches of dense binary column buffers means implementations can pointer-copy, memory map, use bulk copy primitives, etc. for zero-copy or at least highly accelerated interop. Python RAPIDS GPU kernels can therefore selectively stream in a few parquet columns across many parquet files through a single 900GB/s GPU, compute over them, and write back out to arrow or a new parquet.