Hacker News new | ask | show | jobs
by aynyc 1875 days ago
Can you persist Arrow-format data to disk? I see a lot of interests in it, but I can't figure out the use cases. For example, let's say I have ton (xx TB) of well-structured objects on S3. I want to run query via Spark/Presto on the data. I still need to deserialize the data from ORC/PARQUET into ARROW right? The advantage with ARROW here is if Spark/Presto can use this format to pass the data between worker nodes, the query would be faster because we don't need to deserialize/serialize when passing data between nodes? If yes, how do I utilize the format in Spark/Presto?
1 comments

You can: it is serializable and self-describing. However, unless "disk" is super fast and thus more likely memory, and your data is ephemeral, you probably shouldn't. Instead, we've been happier as parquet/orc: tunable compression, nicer multi-part / parallel readers, and a bit more stable.

There is feather for persistence, but you don't need it: just as how you can stream binary arrow buffers to processes, you can write raw arrow to disk. In theory it might give some teams in some setups parallel read/write speedups, but we've been exploring other paths there, e.g., 90+GB/s per node via GDS https://pavilion.io/nvidia . I'm not aware of feather efforts targeting that kind of perf but would be curious!

To utilize w/ spark.. it already does underneath ;-) an increasing flow is something like spark filter -> gpu compute+ai, where the transfer is spark cpu rdd -> arrow (spark-native) -> rapids/tensorflow

Edit: Arrow dev does seem more active than parquet/orc (and a lot of their dev is _by_ arrow devs!), so give it another couple of years, and I can see arrow being stable enough that you can persist data with less fear of having to reprocess older files and having most of the compression features you'd want!

Got ya. We are sticking with Par/Orc for now, we are running into the scenario where size of the data is going up, query SLA is going down. At some point, we will need to look at other technology to reduce cost without sacrificing performance.
Yep. I may have been unclear, they work well together: we'll do a gpu parquet reader that returns an arrow dataframe that our ETL pipeline then transforms into visual depictions of the correlations+relationships in people's datasets. Stuff on disk is nice stable formats, stuff across our API boundaries & compute frameworks is arrow.
Interesting design! How big is your data per scan?
it varies.. a lot of our users look at say 50kb files for quick small and targeted visual sessions , but when doing something like a log dump analysis, we are working on TB files and 1-2 GB per streaming part is good. CPU arrow people like to do say 10KB-1MB per record batch, but GPU land is a lot faster by thinking in terms of bandwidth, and so 500MB-10GB per contiguous part, depending on GPU memory and working set size. likewise, depends on how compressed it is, as you ultimately care how much it uncompresses into for the downstream memory pressure. hope that makes sense!
You run TB files against GPU? Hmm... that's something I've never thought off. Interesting, any idea where I can research into?
> unless "disk" is super fast and thus more likely memory, and your data is ephemeral, you probably shouldn't

Can you elaborate why Arrow is not a good format for storing to disk? If you’re using it for in-memory querying, why would you not want to also serialize it directly to disk instead of using some intermediary format?

Stability: The format is still evolving

Performance: Arrow does not do significant compression. Feather started adding it, but that adds even more change risk. Parquet/ORC/Arrow are all fairly similar, so until Arrow catches up and stablizes, I'd stick w/ Parquet/ORC. We do GPU stuff, and get in-GPU decompression already, so that's been a win/win.