Hacker News new | ask | show | jobs
by sradman 1979 days ago
O_DIRECT prevents file double buffering by the OS and DBMS page cache. MMAP removes the need for the DBMS page cache and relies on the OS’s paging algorithm. The gain is zero memory copy and the ability for multiple processes to access the same data efficiently.

Apache Arrow takes advantage of mmap to share data across different language processes and enables fast startup for short lived processes that re-access the same OS cached data.

1 comments

Yes, but the claim is that the buffer you should remove is the OS's one, not the DBMS's one, because for the DBMS use case (one very large file with deep internal structure, generally accessed by one long-running process), the DBMS has information the OS doesn't.

Arrow is a different use case, for which mmap makes sense. For something like a short-lived process that stores config or caches in SQLite, it probably is actually closer to Arrow than to (e.g.) Postgres, so mmap likely also makes sense for that. (Conversely, if you're not relying on Arrow's sharing properties and you have a big Python notebook that's doing some math on an extremely large data file on disk in a single process, you might actually get better results from O_DIRECT than mmap.)

In particular, "zero memory copy" only applies if you are accessing the same data from multiple processes (either at once or sequentially). If you have a single long-running database server, you have to copy the data from disk to RAM anyway. O_DIRECT means there's one copy, from disk to a userspace buffer; mmap means there's one copy, from disk to a kernel buffer. If you can arrange for a long-lived userspace buffer, there's no performance advantage to using the kernel buffer.

> but the claim is that the buffer you should remove is the OS's one

I was not trying to minimize O_DIRECT, I was trying to emphasize the key advantage succinctly and also explain the Apache Arrow use case of mmap which the article does not discuss.