Hacker News new | ask | show | jobs
by benlwalker 1156 days ago
I've spent essentially the last year trying to find the best way to use io_uring for networking inside the NVMe-oF target in SPDK. Many of my initial attempts were also slower than our heavily optimized epoll version. But now I feel like I'm getting somewhere and I'm starting to see the big gains. I plan to blog a bit about the optimal way to use it, but the key concepts seem to be:

1) create one io_uring per thread (much like you'd create one epoll grp)

2) use the provided buffer mechanism to post a pool of large buffers to an io_uring. Bonus points for the newer ring based version.

3) keep a large (128k) async multishot recv posted to every socket in your set always

4) as recvs complete, append the next "segment" of the stream to a per-socket list.

5) parse the protocol stream. As you make it through each segment, return it to the pool*

6) aggressively batch data to be sent. You can only have one outstanding at a time per socket, so make it a big vectored write. Writes are only outstanding until they're confirmed queued in your local kernel, so it is a fairly short time until you can submit more, but it's worth batching into a single larger operation.

* If you need part of the stream to live for an extended period of time, as we do for the payloads in NVMe-oF, build scatter gather lists that point into the segments of the stream and then maintain a reference counts to the segments. Return the segments to the pool when it drops to zero.

Everyone knows the best way to use epoll at this point. Few of us have really figured out io_uring. But that doesn't mean it is slower.

4 comments

> Few of us have really figured out io_uring. But that doesn't mean it is slower.

seastar.io is a high level framework that I believe has "figured out" io_uring, with additional caveats the framework imposes (which is honestly freeing).

Additionally the rust equivalent: https://github.com/DataDog/glommio

seastar uses DPDK, and spdk uses it too. the OP is a maintainer on that.
For network, I believe io_uring is used for disk access though
It's also worth noting that io_uring has had at most 10-15 engineer-years worth of performance tuning vs. the many (?) hundreds of years that epoll has received. I work with Jens, Pavel, and others and can confidently say that low-queue-depth perf parity with epoll is an important goal to the effort

As an aside, it's great to see high praise from an spdk maintainer. One of the big reasons for doing io_uring in the first place was that it was impossible to compete in terms of performance with total bypass unless you changed the syscall approach.

I'd be very interested to read that blog post. Besides your tips for maximum performance, I'm curious about the minimum you have to do to get a significant improvement. I can easily imagine someone basically using it to poll for readiness like epoll and being disappointed. But if that's enough to benefit, I'd be surprised and intrigued. More likely you need to actually use it to enqueue the op, but folks have struggled with ownership. Is doing that in a not quite optimal way (extra copies on the user side) enough? Or do you need to optimize those away? Do you need to do the buffer pooling and or multishot stuff?
Do fixed buffers help for network I/O? In August 2022 @axboe said "No benefits for fixed buffers with sockets right now, this will change at some point."
To address my own silly questions, yes, one should use the new fixed buffers described in this document: https://github.com/axboe/liburing/wiki/io_uring-and-networki...