|
|
|
|
|
by sanderjd
1261 days ago
|
|
Since you know a bunch about this, I'm going to ask you a question that I was about to research: If I have a dataset in memory in Arrow, but I want to cache it to disk to read back in later, what is the most efficient way to do that at this moment in time? Is it to write to parquet and read the parquet back into memory, or is there a more efficient way to write the native Arrow format such that it can be read back in directly? I think this sounds kind of like Flight, except that my understanding is that is intended for moving the data across a network rather than temporally across a disk. |
|
- Save to feather format. Feather format is essentially the same thing as the Arrow in-memory format. This is uncompressed and so if you have super fast IO, it'll read back to memory faster, or at least, with minimal CPU usage.
- Save to compressed parquet format. Because you're often IO bound, not CPU bound, this may read back to memory faster, at the expense of the CPU usage of decompressing.
On a modern machine with a fast SSD, I'm not sure which would be faster. If you're saving to remote blob storage e.g. S3, parquet will almost certainly be faster.
See also https://news.ycombinator.com/item?id=34324649