| Arrow + Parquet is brilliant! Right now I'm writing tools in Python (Python!) to analyse several 100TB datasets in S3. Each dataset is made up of 1000+ 6GB parquet files (tables UNLOADed from AWS Redshift db). Parquet's columnar compression gives a 15x reduction in on-disk size. Parquet also stores chunk metadata at the end of each file, allowing reads to skip over most data that isn't relevant. And once in memory, the Arrow format gives zero-copy compatibility with Numpy and Pandas. If you try this with Python, make sure you use the latest 2.0.0 version of PyArrow [1]. Two other interesting libraries for manipulating PyArrow Tables and ChunkedArrays are fletcher [2] and graphique[3]. [1] I use: conda install -c conda-forge pyarrow python-snappy [2] https://github.com/xhochy/fletcher [3] https://github.com/coady/graphique |
You pay such a high overhead marshalling that data into an Arrow RecordBatch. Best thing ever is to work with the Parquet file and not even decompress the chunks that you don't need. Of course, this assumes that you're writing summary statistics as part of the metadata, which we plan to do.