| In late 2021 I compared OS threads to io_uring for filesystem I/O at random-access reads from fast, NVMe SSDs. That measurement told me that it's not necessary to use io_uring for disk I/O performance for some workloads. It found no improvement in performance from io_uring, compared with a dynamic thread pool which tries to maintain enough I/O-blocked threads to keep the various kernel and device queues busy enough. This was a little surprising, because the read-syscall overhead when using threads was measurable. preadv2() was surprisingly much slower than pread(), so I used the latter. I used CLONE_IO and very small stacks for the I/O threads (less than a page; about 1kiB IIRC), but the performance was pretty good using only pthreads without those thread optimisations. Probably I had a good thread pool and queue logic, as it surprised me that the result was much faster than "fio" banchmark results had led me to expect. In principle, io_uring should be a little more robust to different scenarios with competing processes, compared with blocking I/O threads, because it has access to kernel scheduling in a way that userspace does not. I also expect io_uring to get a little faster with time, compared with the kernel I tested on. However, on Linux, OS threads* have been the fastest way to do filesystem and block-device I/O for a long time. (* except for CLONE_IO not being set by default, but that flag is ignored in most configurations in current kernels), |
Interesting, didn't realize the kernel would let you do that. I guess it makes sense since it's up to user space to map pages for the stack. The kernel doesn't have much to do on clone except set the stack pointer.