| There are a couple of challenges with mixed read+write workloads on NVMe. In practice, read latency tends to degrade over time under mixed load. We observe this even across relatively short consecutive runs. To get meaningful results, you need to first drive the device into a steady state. In our case, however, we were primarily interested in software overhead rather than device behavior. For a cleaner comparison, it would probably make sense to use something like an in-memory block device (e.g., ublk), but we didn’t dig into it. As for profiling: we didn’t run perf, so the following is my educated guess: 1. With libaio, control structures are copied as part of submission/completion. io_uring avoids some of this overhead via shared rings and pre-registered resources.
2. In our experience (in YDB), AIO syscall latency tends to be less predictable, even when well-tuned.
3. Although we report throughput, the setup is effectively latency-bound (single fio job). With more concurrency, libaio might catch up. We intentionally used a single job because we typically aim for one thread per disk (two at most if polling enabled). In our setup (usually 6 disks), increasing concurrency per device is not desirable. |
> In practice, read latency tends to degrade over time under mixed load. We observe this even across relatively short consecutive runs. To get meaningful results, you need to first drive the device into a steady state. In our case, however, we were primarily interested in software overhead rather than device behavior.
I see. Provocative thought in that case would then be - in what % are io_uring improvements (over libaio) undermined by the device behavior (firmware) in mixed workloads. That % could range from noticeably to almost nothing so it might very well affect the experiment conclusion.
For example, if one is posing the question if switching to io_uring is worth it, I could definitely see different outcomes of that experiment in mixed workloads per observations that you described.
> For a cleaner comparison, it would probably make sense to use something like an in-memory block device (e.g., ublk), but we didn’t dig into it.
Yeah but in which case you would then be testing the limits of ublk performance, no? Also, it seems to be implemented on top of io_uring AFAICS.
I have personally learned to make experiments, and derive the conclusions out of them by running them in the environment which is as close as it gets to the one in production. Otherwise, there's really no guarantee that behavior observed in env1 will be reproducible or correlate to the behavior in env2. Env1 in this particular case could be write-only workload while env2 would be a mixed-workload.
> We intentionally used a single job because we typically aim for one thread per disk (two at most if polling enabled). In our setup (usually 6 disks), increasing concurrency per device is not desirable
This is also interesting. May I ask why is that the case? Are you able to saturate the NVMe disk just with a single thread? I assume not but you may be using some particular workloads and/or avoiding kernel that makes this possible.