Hacker News new | ask | show | jobs
by JoshTriplett 683 days ago
No, most operations in the ring directly work asynchronously. The thread mechanism only exists as a fallback for combinations of operations and system configurations (e.g. filesystems) that don't support asynchronous operation.
1 comments

I don't know anything about the internals of io_uring and am genuinely curious how it works. Saying it "directly works asynchronously" doesn't mean anything though. When circular buffer requests are processed what thread is processing the request, how is that thread managed, and how does it manage blocking/unblocking when communicating with the storage device?
Internally, many parts of the Linux kernel operate asynchronously: they queue up a request with some subsystem (e.g. a hardware device), and get an event delivered when the request is completed. In such cases, io_uring can enqueue such a request, and complete it when receiving the event, without needing to use a thread to block waiting for it.

See, for instance, https://lpc.events/event/11/contributions/901/attachments/78... slide 5 (though more has happened since then). io_uring will first see if it has everything needed to do the operation immediately, if not it'll queue a request in some cases (e.g. direct I/O, or buffered I/O in some cases). The thread pool is the last fallback, which always works if nothing else does.

https://lwn.net/Articles/821274/ talks about making async buffered reads work, for instance.

Is it safe to say that a single thread using io_uring should be as fast or faster than N threads performing the same set of I/O tasks in a blocking manner?

In other words, can you count on the kernel to use its own threads internally whenever an I/O task might actually need to use a lot of CPU?

If you saturate the submission queue with CPU-bottlenecked tasks, it defeats the value-add of io_uring - at that point, you might as well replace your kernel-space thread pool with a user-space one.
Sure, but that approach forces you to consider/research just how much CPU your I/O tasks may or may not require. What if I'm not sure? How CPU-intensive is open()? What about close()? What about read()?

It would simplify my design process if I could count on io_uring being optimal for ~all I/O tasks, rather than having to treat "CPU-heavy I/O" and "CPU-light I/O" as two separate things that require two separate designs.

This is something that will require profiling to get exact numbers. The non-async portions of a high level filesystem read operation appear rather trivial: checking for cache hits (page cache, dentry cache, etc), parsing the inode/dentry info, and the memcpy to userspace. I wouldn't worry about any of these starving subsequent io_uring SQEs.

I reckon the most likely place you'd find unexpected CPU-heavy work is at the block layer. Software RAID and dmcrypt will burn plenty of cycles, enough to prove as exceptions to the "no FPU instructions in the kernel" guideline.