Hacker News new | ask | show | jobs
by nicolast 2103 days ago
Not an Arrow expert at all, so I may be missing something, but I fail to understand the "The Science of Reading/Writing Data" section, or rather, its relevance to the article (and Arrow).

From what I could find, Arrow supports reading (writing?) data from (to?) memory-mapped files (i.e., memory regions created through mmap and friends). However, this has no relation to how the IO is being done, hence not related to access to IO devices using either IO ports, or memory mappings (DMA and such).

This section seems to be mixing up two fairly distinct concepts, i.e., talking about ways to access IO devices and transfer data to/from them (among which memory mapping is an option), where the memory mapping (mmap of files) as used by Arrow is something a little different.

1 comments

The section is completely false. Memory-mapping files maps pages from the page cache, which lives in main memory. I suppose the author confused this with memory-mapped I/O and then confused port-based I/O with applications using syscalls. I see how you can arrive in that situation when you're only viewing the system through high-level abstractions in Java.

> Thanks to Wes McKinney for this brilliant innovation, its not a surprise that such an idea came from him and team, as he is well known as the creator of Pandas in Python. He calls Arrow as the future of data transfer.

I assume the confusion is with the author of the blog post and not Wes mcKinney, so this callout in that context is a real disservice.

> The output which displays the time, shows the power of this approach. The performance boost is tremendous, almost 50%.

Keep in mind that this is reading a 2 MB file with 100k entries, which somehow manages to consume half a second of CPU time. The author compares wall time and not CPU time; both runs consume somewhere between 600 ms and over a second of wall time (again, handling 2 MB of data). I wouldn't be surprised if the first call simply takes so long because it is lazily loading a bunch of code.

Later on memory consumption is measured, and one of the file format readers manages to consume -1 MB.

This article has a very bad smell.