|
|
|
|
|
by sradman
1979 days ago
|
|
O_DIRECT prevents file double buffering by the OS and DBMS page cache. MMAP removes the need for the DBMS page cache and relies on the OS’s paging algorithm. The gain is zero memory copy and the ability for multiple processes to access the same data efficiently. Apache Arrow takes advantage of mmap to share data across different language processes and enables fast startup for short lived processes that re-access the same OS cached data. |
|
Arrow is a different use case, for which mmap makes sense. For something like a short-lived process that stores config or caches in SQLite, it probably is actually closer to Arrow than to (e.g.) Postgres, so mmap likely also makes sense for that. (Conversely, if you're not relying on Arrow's sharing properties and you have a big Python notebook that's doing some math on an extremely large data file on disk in a single process, you might actually get better results from O_DIRECT than mmap.)
In particular, "zero memory copy" only applies if you are accessing the same data from multiple processes (either at once or sequentially). If you have a single long-running database server, you have to copy the data from disk to RAM anyway. O_DIRECT means there's one copy, from disk to a userspace buffer; mmap means there's one copy, from disk to a kernel buffer. If you can arrange for a long-lived userspace buffer, there's no performance advantage to using the kernel buffer.