Hacker News new | ask | show | jobs
by wesm 2149 days ago
hi, Wes (Apache Arrow co-creator and Python pandas creator) here! If you're wondering what this project is all about, my JupyterCon keynote (18 min long) from 3 years ago is a good summary and the vision / scope for what we've been doing since 2016 has been pretty consistent

https://www.youtube.com/watch?v=wdmf1msbtVs

3 comments

Hi Wes!

I'm a big fan of Pandas, and didn't know about Arrow. I've been considering do a talk advocating for a consistent data-frame api across languages since IMHO, it's the next fundamental data structure that should have baked in support everywhere. So it appears you've at least somewhat beaten me to the punch.

Since Arrow is more than an API to tabular data structures, what would you think about a Promises/A+-like specification for dataframes?

How much of the Arrow API do you think end users will wind up using, as opposed to being a lower-level framework that projects like pandas and dplyr wind up using behind the scenes?

Finally, do you think that Arrow has the potential to be the logical successor to pandas? If not, what is your long term strategy to address the shortcomings that you see in pandas?

Hi Wes

I use pandas every day, thank you for that. Just watched this keynote and I really like the vision; I currently work with a bunch of guys who prefer DPLYR.

Is Arrow just for in-memory analytics, or are there plans to support in-database analytics too?

Thanks Wes, pandas and Arrow are great projects. Is Feather now ready for long-term storage with the V2 release? And now that it's just a renaming of the Arrow IPC format, what's its future?
See

http://arrow.apache.org/faq/index.html#what-about-arrow-file...

You can store them long-term if you want (and you'll still be able to read them 5 years from now) but we aren't optimizing the Arrow IPC format for the _needs_ of long-term storage.

I’m also very confused about the relationship of Arrow’s stability guarantees and that of the on-disk feather. Can we safely switch from parquet to feather for long-term data storage?
same question here - why Feather and why is it named differently ?

Also is Parquet and Arrow the same ? df.to_parquet('df.parquet.gzip', compression='gzip') will not use arrow i presume ? i have to use a separate library to save to parquet using arrow. a bit confused.

Graphistry uses parquet as a more stable and thus persistent storage format when folks save data, and arrow for ephemeral internal data where we're ok (and somewhat enjoy) version changes, as that just means code version upgrades. There are performance differences in practice such as parquet having more per-column compression modes built in, making it attractive for colder storage, and arrow for in-memory/rpc/streaming/etc for similar reasons.

RE:Feather -- Arrow itself isn't necessarily a full file format -- you can imagine memory buffers being all over the heap with giant gaps inbetween b/c diff cols generated at diff times -- but in practice folks will indeed serialize to disk consolidated buffers (pa.Table -> write stream -> file). If we couldn't do that, RPC wouldn't work ;-) My understanding of Feather is it standardizes ideas around this consolidation, but we are able save to disk (within versions) without it. We found it more predictable to stick to ~Parquet for storage and Arrow buffer passing for streaming, but now that Feather networking APIs for accelerated bulk transfers may be stablizing, there may be speed advantages to using it over manual buffer streaming (and still stick w/ Parquet for persistent files).

Arrow<>Parquet conversion is super fast b/c of the co-design around similar concepts: both using record batches of dense binary column buffers means implementations can pointer-copy, memory map, use bulk copy primitives, etc. for zero-copy or at least highly accelerated interop. Python RAPIDS GPU kernels can therefore selectively stream in a few parquet columns across many parquet files through a single 900GB/s GPU, compute over them, and write back out to arrow or a new parquet.