|
|
|
|
|
by aynyc
1875 days ago
|
|
Can you persist Arrow-format data to disk? I see a lot of interests in it, but I can't figure out the use cases. For example, let's say I have ton (xx TB) of well-structured objects on S3. I want to run query via Spark/Presto on the data. I still need to deserialize the data from ORC/PARQUET into ARROW right? The advantage with ARROW here is if Spark/Presto can use this format to pass the data between worker nodes, the query would be faster because we don't need to deserialize/serialize when passing data between nodes? If yes, how do I utilize the format in Spark/Presto? |
|
There is feather for persistence, but you don't need it: just as how you can stream binary arrow buffers to processes, you can write raw arrow to disk. In theory it might give some teams in some setups parallel read/write speedups, but we've been exploring other paths there, e.g., 90+GB/s per node via GDS https://pavilion.io/nvidia . I'm not aware of feather efforts targeting that kind of perf but would be curious!
To utilize w/ spark.. it already does underneath ;-) an increasing flow is something like spark filter -> gpu compute+ai, where the transfer is spark cpu rdd -> arrow (spark-native) -> rapids/tensorflow
Edit: Arrow dev does seem more active than parquet/orc (and a lot of their dev is _by_ arrow devs!), so give it another couple of years, and I can see arrow being stable enough that you can persist data with less fear of having to reprocess older files and having most of the compression features you'd want!