Hacker News new | ask | show | jobs
by slt2021 960 days ago
the O(1) random access doesn't look like advantage at all, honestly.

if I load my parquet into memory - I will have O(1) random access to any row just as well.

plus, considering that Arrow recommends to work in chunks of 1000 rows per file, I am curious to learn exact tasks for which Arrow is optimizing for.

the only use case I can think of is transferring data between systems written in different languages/runtimes and doing zero serialization/deserialization, just send/receive memory buffers that are nicely mapped to dataframes.

1 comments

Into what format in memory? You can't just mmap into a parquet file and access whole records, it's a columnar format.

Arrow is absolutely designed for Interop between languages - often devs want to develop their core in something like Python, and the platform devs want to work on Scala. Arrow lets you write all the distributed system and data shuffling in Scala, but then users can access the same records in their user-defined code with minimal overhead