|
|
|
|
|
by slt2021
960 days ago
|
|
question about Arrow: the format seems to be not very space efficient. I tried converting one of my parquet files from datalake from parquet to arrow and size difference is staggering.
20mb parquet -> 700mb arrow. doesnt seem fit for datalake at all |
|
> Parquet and Arrow are complementary technologies, and they make some different design tradeoffs. In particular, Parquet is a storage format designed for maximum space efficiency, whereas Arrow is an in-memory format intended for operation by vectorized computational kernels.
> The major distinction is that Arrow provides O(1) random access lookups to any array index, whilst Parquet does not. In particular, Parquet uses dremel record shredding, variable length encoding schemes, and block compression to drastically reduce the data size, but these techniques come at the loss of performant random access lookups.