|
|
|
|
|
by saghm
2 hours ago
|
|
Yeah, the explanation that I've usually heard for this sort of thing is that it's intended to get back CPU time that's lost when too many system threads are blocking to keep something on every core even during I/O (or pay for it in terms of the context switching overhead if you compensate for this with an extremely large number of system threads). The theory is that you'll avoid idle CPU compared to the common "one thread per core" way of doing things due to some of them being idle during I/O, at the cost of using some extra CPU to handle more things in user space. Obviously how much this helps can vary between use cases, but the measure of how much it's helping (or if it's maybe not helping at all!) is throughput, not CPU utilization. |
|
Specifically, not thread per core code has the following issues:
* you have to use atomics/locks to synchronize data access. This involves expensive HW operations to implement the semantics of what an atomic operation is
* you have to deal with lock contention and cache contention
* when an OS migrates the core that is executing your code, you’ve suddenly got cold caches all over the place (icache, dcache, and TLB).
There’s also a bunch of related things that pop up - even if you do thread per core, the processor interrupts for events probably land on a different CPU resulting in extra overhead within the OS to deliver the event to you.
Io_uring doesn’t “handle more things in user space”. It specifically avoids a bunch of overheads; you’re context switching less (other cores can execute the OS code to process your request) and you can pipeline I/O (you can tell the OS “do IO A, then B, then C and tell me when that’s all done”) and you get fewer memory copies (the kernel reads into your buffer directly without needing to create another copy although this is more nuanced).
Anyway, the better mental model is specifically io_uring is more efficient and thus CPUs spend less time standing around waiting for things to happen at the hardware level (context switching, waiting for locks, etc). If the CPUs weren’t actually spending much time waiting, then you don’t get much benefit. This is the same phenomenon as Jevons paradox in economics; IO gets cheaper so you can do more of it within a given time unit and thus your CPUs end up more often having real work to do.