Hacker News new | ask | show | jobs
by dmytroi 1981 days ago
Did some research on the topic of high bandwidth/high IOPS file accesses, some of my conclusions could be wrong though, but as I discovered modern NVMe drives need to have some queue pressure on them to perform at advertised speeds, as in hardware level they are essentially just a separate CPU running in background that has command queue(s). They also need to have requests align with flash memory hierarchy to perform at advertised speeds. So that's puts a quite finicky limitation on your access patterns: 64-256kb aligned blocks, 8+ accesses in parallel. To see that just try CrystalDiskMark and put queue depth at 1-2, and/or block size on something small, like 4kb, and see how your random speed plummets.

So given the limitations on the access pattern, if you just mmap your file and memcpy the pointer, you'll get like ~1 access request in flight if I understand right. And also as default page size is 4kb, that will be 4kb request size. And then your mmap relies on IRQ's to get completion notifications (instead of polling the device state), somewhat limiting your IOPS. Sure prefetching will help of course, but it is relying on a lot of heuristic machinery to get the correct access pattern, which sometimes fails.

As 7+GB/s drives and 10+Gbe networks become more and more mainstream, the main point where people will realize these requirements are in file copying, for example Windows explorer struggles to copy files at rates 10-25GBe+ simply because how it's file access architecture is designed. And hopefully then we will be better equip to reason about "mmap" vs "read" (really should be pread here to avoid the offset sem in the kernel).

2 comments

Yep, mmap is really bad for performance on modern hardware because you can only fault on one page at a time (per thread), but SSDs require a high queue depth to deliver the advertised throughput. And you can't overcome that limitation by using more threads, because then you spend all your time on context switches. Hence, io_uring.
Can't you just use MAP_POPULATE which asks the system to populate the entire mapped address range, which is kind of like page-faulting on every page simultaneously?
That usually works if you have sufficient RAM, and do plan to touch substantially all of the file, and don't have any tight QoS targets to meet around the time you map the file.
If you're reading sequentially this shouldn't be a problem because the VM system can pick up hints, or you can use madvise.

If you're reading randomly this is true and you want some kind of async I/O or multiple read operation.

mmap is also dangerous because there's no good way to return errors if the I/O fails, like if the file is resized or is on an external drive.

Even if you use madvise() for a large sequential read, the kernel will often restrict its behavior to something suboptimal with respect to performance on modern hardware.
If I read with a huge block size, say 100mb. Will the OS request things in a sane way?
Yeah. Linux will end up splitting the requests down to typically 128kB blocks, but they're submitted to the SSD as a batch rather than one at a time, so there's sufficient work to keep the drive properly busy. But only do this if you actually need all 100MB. If you're randomly accessing only bits and pieces of the file, it's usually better to stick with 4kB requests (or larger if your file format and access patterns make that appropriate).
Typically reviews of drives publish rates at different queue depths, or at least specify the queue depths tested.