Hacker News new | ask | show | jobs
by seg_lol 2053 days ago
I am confused about Parquet wrt to the rest of your stack. Is it just that Parquet happens to be the Redshift export format? Or are you actually using Arrow and Parquet at the same time in some manner?
1 comments

Parquet is the persistent storage format for when the data is written to disk or object storage. Arrow is the in-memory format and a set of tools built around working with it.

So InfluxDB IOx will use both, Arrow in-memory for fast access, and Parquet on storage for persistence.

Ok, I didn't realize that Arrow was in-memory and not also on-disk. As serializing between the two didn't make sense to me. I would have thought that Arrow would also be an on-disk format (mmap'd) so that there would be little to no conversion losses.

Why convert to an on-disk format (Parquet) and not save the in-memory representation to storage directly?