|
|
|
|
|
by thamer
2053 days ago
|
|
I'm using Apache Arrow to store CSV-like data and it works amazingly well. The datasets I work with contains a few billion records with 5-10 fields each, almost all 64-bit longs; I originally started with CSV and soon switched to a binary format with fixed-sized records (mmap'd) which gave great performance improvements, but the flexibility, size gains due to columnar compression and the much greater performance of Arrow for queries that span a single column or a small number of them won me over. For anyone who has to process even a few million records locally, I would highly recommend it. |
|
Right now I'm writing tools in Python (Python!) to analyse several 100TB datasets in S3. Each dataset is made up of 1000+ 6GB parquet files (tables UNLOADed from AWS Redshift db). Parquet's columnar compression gives a 15x reduction in on-disk size. Parquet also stores chunk metadata at the end of each file, allowing reads to skip over most data that isn't relevant.
And once in memory, the Arrow format gives zero-copy compatibility with Numpy and Pandas.
If you try this with Python, make sure you use the latest 2.0.0 version of PyArrow [1]. Two other interesting libraries for manipulating PyArrow Tables and ChunkedArrays are fletcher [2] and graphique[3].
[1] I use: conda install -c conda-forge pyarrow python-snappy
[2] https://github.com/xhochy/fletcher
[3] https://github.com/coady/graphique