Hacker News new | ask | show | jobs
by tele_ski 1824 days ago
What you're describing sounds awesome, I hadn't thought about being able to string syscall commands together like that. I wonder how well that will work in practice? Is there a way to be notified if one of the commands in the sequence fails like for instance the buffer wasn't large enough to write all the incoming data into?
4 comments

According to a relevant manpage[0]:

> Only members inside the chain are serialized. A chain of SQEs will be broken, if any request in that chain ends in error. io_uring considers any unexpected result an error. This means that, eg, a short read will also terminate the remainder of the chain. If a chain of SQE links is broken, the remaining unstarted part of the chain will be terminated and completed with -ECANCELED as the error code.

So it sounds like you would need to decide what your strategy is. It sounds like you can inspect the step in the sequence that had the error, learn what the error was, and decide whether you want to re-issue the command that failed along with the remainder of the sequence. For a short read, you should still have access to the bytes that were read, so you're not losing information due to the error.

There is an alternative "hardlink" concept that will continue the command sequence even in the presence of an error in the previous step, like a short read, as long as the previous step was correctly submitted.

Error handling gets in the way of some of the fun, as usual, but it is important to think about.

[0]: https://manpages.debian.org/unstable/liburing-dev/io_uring_e...

I'm looking at the evolution in the chaining capabilities of io_uring. Right now it's a bit basic but I'm guessing in 5 or 6 kernel versions people will have built a micro kernel or a web server just by chaining things in io_uring and maybe some custom chaining/decision blocks in ebpf :-)
BPF, you say? https://lwn.net/Articles/847951/

> The obvious place where BPF can add value is making decisions based on the outcome of previous operations in the ring. Currently, these decisions must be made in user space, which involves potential delays as the relevant process is scheduled and run. Instead, when an operation completes, a BPF program might be able to decide what to do next without ever leaving the kernel. "What to do next" could include submitting more I/O operations, moving on to the next in a series of files to process, or aborting a series of commands if something unexpected happens.

Yes exactly what I had in mind. I'm also thinking of a particular chain of syscalls [0][1][2][3] (send netlink message, setsockopt, ioctls, getsockopts, reads, then setsockopt, then send netlink message) grouped so as to be done in one sequence without ever surfacing up to userland (just fill those here buffers, who's a good boy!). So now I'm missing ioctls and getsockopts but all in good time!

[0] https://github.com/checkpoint-restore/criu/blob/7686b939d155...

[1] https://github.com/checkpoint-restore/criu/blob/7686b939d155...

[2] https://github.com/checkpoint-restore/criu/blob/7686b939d155...

[3] https://www.infradead.org/~tgr/libnl/doc/api/group__qdisc__p...

BPF is going to change so many things... At the moment I'm having lots of trouble with the tooling but hey, let's just write BPF bytecode by hand or with a macro-asm. Reduce the ambitions...
Also wondering whether we should rethink language runtimes for this. Like write everything in SPARK (so all specs are checked), target bpf bytecode through gnatllvm. OK you've written the equivalent of a cuda kernel or tbb::flow block. Now for the chaining y'all have this toolbox of task-chainers (barriers, priority queues, routers...) and you'll never even enter userland? I'm thinking /many/ programs could be described as such.
One obvious limit is that usually chaining the open op is not very straightforward as following commands tend to need the fd.

But to answer your actual question, you can link requests and either abort or continue on failure.

Yes, check the documentation for the IOSQE_IO_LINK flag to see exactly how this works.