io_uring allows for async reads and writes to disk without forcing a thread pool or direct I/O. That alone makes it much more scalable for workloads that touch both the network and disk.
The point was that io_uring isn't going to make a big difference for the network code, as for disk I/O code (especially for the sorts of things GP is talking about) you have a bounded number of "threads" of execution anyway. For a node in a pub-sub system, maybe it has c10k users but it's probably appending to a handful of LSM-like datastructures that are written sequentially to disk. The biggest difference is random reads, but even then you can saturate what the disk will do with double digit numbers of threads.
No, most operations in the ring directly work asynchronously. The thread mechanism only exists as a fallback for combinations of operations and system configurations (e.g. filesystems) that don't support asynchronous operation.
I don't know anything about the internals of io_uring and am genuinely curious how it works. Saying it "directly works asynchronously" doesn't mean anything though. When circular buffer requests are processed what thread is processing the request, how is that thread managed, and how does it manage blocking/unblocking when communicating with the storage device?
Internally, many parts of the Linux kernel operate asynchronously: they queue up a request with some subsystem (e.g. a hardware device), and get an event delivered when the request is completed. In such cases, io_uring can enqueue such a request, and complete it when receiving the event, without needing to use a thread to block waiting for it.
See, for instance, https://lpc.events/event/11/contributions/901/attachments/78... slide 5 (though more has happened since then). io_uring will first see if it has everything needed to do the operation immediately, if not it'll queue a request in some cases (e.g. direct I/O, or buffered I/O in some cases). The thread pool is the last fallback, which always works if nothing else does.
Is it safe to say that a single thread using io_uring should be as fast or faster than N threads performing the same set of I/O tasks in a blocking manner?
In other words, can you count on the kernel to use its own threads internally whenever an I/O task might actually need to use a lot of CPU?
If you saturate the submission queue with CPU-bottlenecked tasks, it defeats the value-add of io_uring - at that point, you might as well replace your kernel-space thread pool with a user-space one.