Hacker News new | ask | show | jobs
by lmeyerov 1875 days ago
You can: it is serializable and self-describing. However, unless "disk" is super fast and thus more likely memory, and your data is ephemeral, you probably shouldn't. Instead, we've been happier as parquet/orc: tunable compression, nicer multi-part / parallel readers, and a bit more stable.

There is feather for persistence, but you don't need it: just as how you can stream binary arrow buffers to processes, you can write raw arrow to disk. In theory it might give some teams in some setups parallel read/write speedups, but we've been exploring other paths there, e.g., 90+GB/s per node via GDS https://pavilion.io/nvidia . I'm not aware of feather efforts targeting that kind of perf but would be curious!

To utilize w/ spark.. it already does underneath ;-) an increasing flow is something like spark filter -> gpu compute+ai, where the transfer is spark cpu rdd -> arrow (spark-native) -> rapids/tensorflow

Edit: Arrow dev does seem more active than parquet/orc (and a lot of their dev is _by_ arrow devs!), so give it another couple of years, and I can see arrow being stable enough that you can persist data with less fear of having to reprocess older files and having most of the compression features you'd want!

2 comments

Got ya. We are sticking with Par/Orc for now, we are running into the scenario where size of the data is going up, query SLA is going down. At some point, we will need to look at other technology to reduce cost without sacrificing performance.
Yep. I may have been unclear, they work well together: we'll do a gpu parquet reader that returns an arrow dataframe that our ETL pipeline then transforms into visual depictions of the correlations+relationships in people's datasets. Stuff on disk is nice stable formats, stuff across our API boundaries & compute frameworks is arrow.
Interesting design! How big is your data per scan?
it varies.. a lot of our users look at say 50kb files for quick small and targeted visual sessions , but when doing something like a log dump analysis, we are working on TB files and 1-2 GB per streaming part is good. CPU arrow people like to do say 10KB-1MB per record batch, but GPU land is a lot faster by thinking in terms of bandwidth, and so 500MB-10GB per contiguous part, depending on GPU memory and working set size. likewise, depends on how compressed it is, as you ultimately care how much it uncompresses into for the downstream memory pressure. hope that makes sense!
You run TB files against GPU? Hmm... that's something I've never thought off. Interesting, any idea where I can research into?
rapids.ai
> unless "disk" is super fast and thus more likely memory, and your data is ephemeral, you probably shouldn't

Can you elaborate why Arrow is not a good format for storing to disk? If you’re using it for in-memory querying, why would you not want to also serialize it directly to disk instead of using some intermediary format?

Stability: The format is still evolving

Performance: Arrow does not do significant compression. Feather started adding it, but that adds even more change risk. Parquet/ORC/Arrow are all fairly similar, so until Arrow catches up and stablizes, I'd stick w/ Parquet/ORC. We do GPU stuff, and get in-GPU decompression already, so that's been a win/win.