| > Provocative thought in that case would then be - in what % are io_uring improvements (over libaio) undermined by the device behavior (firmware) in mixed workloads. That % could range from noticeably to almost nothing so it might very well affect the experiment conclusion. That’s absolutely fair. Also, it would be useful to test across different devices, since their behavior can vary significantly, especially when preconditioned or under corner-case workloads. In our case, we focused on scenarios typical for YDB deployments, so we didn’t extend the study further. That said, we believe the observed trends are fairly general. > For example, if one is posing the question if switching to io_uring is worth it, I could definitely see different outcomes of that experiment in mixed workloads per observations that you described. I agree that for mixed workloads the outcome may differ. However, for us the primary concern in the AIO vs io_uring comparison is syscall behavior. It is critical that submission does not block unpredictably. Even without polling, io_uring shows consistently better latency across the full range of iodepths. If device latency dominates (as in your scenario), the relative benefit may shrink, but a faster submission path still helps drive higher effective queue depth and utilize the device better. > This is also interesting. May I ask why is that the case? Are you able to saturate the NVMe disk just with a single thread? I assume not but you may be using some particular workloads and/or avoiding kernel that makes this possible. The component we are working on is designed for write-intensive workloads. Due to DWPD constraints, we intentionally limit sustained write throughput to what the device can safely handle over its lifetime. In practice, this is often on the order of ~200–300 MB/s, which a single thread can easily saturate. At the same time, we care a lot about burst behavior. With AIO, we observed poor predictability: total latency depends heavily on how requests are submitted (especially with batching), and syscall time can grow proportionally to batch size * event count. io_uring largely eliminates this issue by decoupling submission from syscalls and providing a much more stable submission path. Additionally, for bursty workloads we can use SQPOLL + IOPOLL to further reduce latency in specialized setups. |
Agreed. And from first-hand experience I know how painful this is, and how proving or disproving the hypothesis you have about certain wheel in the system can turn into a crazy rabbit hole, especially in the infrastructure software which due to ever increasing volume in data (and distribution thereof) is stressing the software and hardware up to their limits.
I used to test my algos across wide range of HW I had access to. It included "slow" HDD and "fast" NVMe disks, even Optane, low and high amount of RAM, slow and fast CPUs, different cache sizes and topologies, NUMA vs no-NUMA etc. This was the case because software developed didn't have the leisure of running within the fully controlled SW/HW so I had to make sure that it runs well across different configurations, even operating systems, microarchitectures, etc.
And it was a challenge to be able to decouple noise from the signal, given how many experiments one had to run and given how volatile (stateful) our HW generally really is, barring all the non-determinism imposed by the software (database kernel + operating system kernel).
> In our case, we focused on scenarios typical for YDB deployments, so we didn’t extend the study further.
Yes, that is fair enough and basically only what matters - not solving a "general" problem but solving a problem at hand has been most successful strategy for me as well.
> It is critical that submission does not block unpredictably. Even without polling, io_uring shows consistently better latency across the full range of iodepths. If device latency dominates (as in your scenario), the relative benefit may shrink, but a faster submission path still helps drive higher effective queue depth and utilize the device better.
Yes, I would probably easily agree that io_uring in general is a better design. C++ executors-like design but in the kernel itself, pretty advanced from what I could tell last time I delved into the implementation details (~2 years ago). Given I had developed an executor-like (userspace) library myself, I figure that in more extreme cases one would like to gain the total control of the IO scheduling and processing process. This is an exercise I would like to do at certain moment.
> ... io_uring largely eliminates this issue by decoupling submission from syscalls and providing a much more stable submission path. Additionally, for bursty workloads we can use SQPOLL + IOPOLL to further reduce latency in specialized setups.
Thanks for sharing the details. I figured there was something peculiar about what you're doing. Quite interesting requirements.