If you saturate the submission queue with CPU-bottlenecked tasks, it defeats the value-add of io_uring - at that point, you might as well replace your kernel-space thread pool with a user-space one.
Sure, but that approach forces you to consider/research just how much CPU your I/O tasks may or may not require. What if I'm not sure? How CPU-intensive is open()? What about close()? What about read()?
It would simplify my design process if I could count on io_uring being optimal for ~all I/O tasks, rather than having to treat "CPU-heavy I/O" and "CPU-light I/O" as two separate things that require two separate designs.
This is something that will require profiling to get exact numbers. The non-async portions of a high level filesystem read operation appear rather trivial: checking for cache hits (page cache, dentry cache, etc), parsing the inode/dentry info, and the memcpy to userspace. I wouldn't worry about any of these starving subsequent io_uring SQEs.
I reckon the most likely place you'd find unexpected CPU-heavy work is at the block layer. Software RAID and dmcrypt will burn plenty of cycles, enough to prove as exceptions to the "no FPU instructions in the kernel" guideline.
> Software RAID and dmcrypt will burn plenty of cycles, enough to prove as exceptions to the "no FPU instructions in the kernel" guideline.
LUKS has a negligible impact on I/O bandwidth, and the same is true for software RAID. I'm almost saturating NVMe drives using a combination of LUKS (aes-xts) and software RAID. Additionally, the encryption and decryption processes are almost free when using hardware AES-NI instructions, especially while waiting for I/O.
Agreed that you are deep into "you need to try & figure out" territory. The abstract theorycrafting has dug too deep, there's no good answers to such questions at this stage.
> The non-async portions of a high level filesystem read operation appear rather trivial: checking for cache hits (page cache, dentry cache, etc), parsing the inode/dentry info, and the memcpy to userspace.
Worth maybe pointing out the slick work excuse has done to make her el ebpf a capable way to do a lot of base fs stuff. That userland can send in ebpf kernel programs to run various of fs task is pretty cool flexibility, and this work has shown colossal gains by having these formerly FUSE filesystems-in-usrwrland getting to author their own & send up their own ebpf to run various of these responsibilities, but now in kernel.
https://github.com/extfuse/extfuse
Very much agreeing again though. Although the CF article highlights extremes, theres really a toolkit described to build io_uring processing as you'd like, shaping how many kernel threads & many other parameters as you please. It feels like there's been asking for specifics of how things work, but it keeps feeling like the answer is that it depends on how you opt to use it.
It would simplify my design process if I could count on io_uring being optimal for ~all I/O tasks, rather than having to treat "CPU-heavy I/O" and "CPU-light I/O" as two separate things that require two separate designs.