|
|
|
|
|
by ayhanfuat
960 days ago
|
|
Arrow is not really designed for storage though. See the "Parquet vs Arrow" section of this post (https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encod...): > Parquet and Arrow are complementary technologies, and they make some different design tradeoffs. In particular, Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels. > The major distinction is that Arrow provides O(1) random access lookups to any array index, whilst Parquet does not. In particular, Parquet uses dremel record shredding, variable length encoding schemes, and block compression to drastically reduce the data size, but these techniques come at the loss of performant random access lookups. |
|
if I load my parquet into memory - I will have O(1) random access to any row just as well.
plus, considering that Arrow recommends to work in chunks of 1000 rows per file, I am curious to learn exact tasks for which Arrow is optimizing for.
the only use case I can think of is transferring data between systems written in different languages/runtimes and doing zero serialization/deserialization, just send/receive memory buffers that are nicely mapped to dataframes.