| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by RobinL 1261 days ago

Yes - save to parquet. From the OP:

"Why not just persist the data to disk in Arrow format, and thus have a single, cross-language data format that is the same on-disk and in-memory? One of the biggest reasons is that Parquet generally produces smaller data files, which is more desirable if you are IO-bound. This will especially be the case if you are loading data from cloud storage like such as AWS S3.

Julien LeDem explains this further in a blog post discussing the two formats:

>> The trade-offs for columnar data are different for in-memory. For data on disk, usually IO dominates latency, which can be addressed with aggressive compression, at the cost of CPU. In memory, access is much faster and we want to optimise for CPU throughput by paying attention to cache locality, pipelining, and SIMD instructions. https://www.kdnuggets.com/2017/02/apache-arrow-parquet-colum..."

1 comments

mempko 1261 days ago

I opted to store feather for one particular reason. You can open it using mmap and randomly index the data without having to load it all in memory. Also the data I have isn't very compressible to begin with, so the cpu cost vs data savings of parquet don't make sense. This only makes sense in that narrow use case.

link

_frkl 1261 days ago

I'm doing the same. It's also quite nice for de-duplication, a lot of operations on our data happen on a column basis, and we need to assemble tables that are basically the same, except for one or two computed columns. I usually store all columns in a separate file, and assemble tables on the fly, also memory-mapped. Quite happy with being able to do that. Not sure how easy that would be with parquet.

link

Infernal 1260 days ago

As someone new to Arrow/columnar DB's, do you mind sharing what kind of data makes sense to use Arrow for, but isn't very compressible?

link