Hacker News new | ask | show | jobs
by captrb 2052 days ago
"Parquet files work well, but streaming is a tad more complex (you need to be able to seek to the end of the file to read the metadata before you can stream the contents)"

I didn't realize that all the metadata in Parquet was stored at the end. That is indeed unfortunate for streaming use cases. Especially sad because columnar dictionary formats can offer great compaction for some data. I've been achieving 20x+ size redutions by converting from CSV to Parquet.

2 comments

Parquet is intended as a file storage format primarily. When streaming, I think you are recommended to use Arrow, which is basically an in-memory Parquet. It supports putting the schema first and streaming a undefined number of rows.

https://arrow.apache.org/docs/python/ipc.html

Do you think of any solutions for this other than batching? I'm working on something similar just now, and I buffer several million records in memory, then write the file to external storage.