| HN Mirror

For sequential accesses, it usually doesn't make a whole lot of difference whether the drive's queue is full of lots of medium-sized requests (eg. 128kB) or a few giant requests (multiple MB), so long as the queue always has outstanding requests for enough data to keep the drive(s) busy. Every operating system will have its own preferred IO sizes for prefetching, and if you're lucky you can also tune the size of the prefetch window (either in terms of bytes, or in terms of number of IOs). Different drives will also have different requirements here to achieve maximum throughput; an enterprise drive that stripes data across 16 channels will probably need a bigger/deeper queue than a consumer drive with just 4 channels, if the NAND page size is the same for both.

However, optimal utilization of the drive(s) will always require a queue depth of more than one request, because you don't want the drive to be idle after signalling completion of its only queued command and waiting for the CPU to produce a new read request. In a RAID0 setup like the author describes, you need to also ensure that you're generating enough IO to keep both drives busy, and the minimum prefetch window size that can accomplish this will usually be at least one full stripe.

As for how you accomplish the prefetching: the madvise system call sounds like a good choice, with the MADV_SEQUENTIAL or MADV_WILLNEED options. But how much prefetching that actually causes is up to the OS and the local system's settings. On my system, /sys/block/$DISK/queue/read_ahead_kb defaults to 128, which is definitely insufficient for at least some drives but might only apply to read-ahead triggered by the filesystem's heuristics rather than more explicitly requested by a madvise. So manually touching pages from a userspace thread is probably the safer way to guarantee the OS pages in data ahead of time—as long as it doesn't run so far ahead of the actual use of the data that it creates memory pressure that might get unused pages evicted.