Hacker News new | ask | show | jobs
by geofft 2232 days ago
One thing this writeup made me realize is, if I have a misbehaving I/O system (NFS or remote block device over a flaky network, dying SSD, etc.), in the pre-io_uring world I'd probably see that via /proc/$pid/stack pretty clearly - I'd see a stack with the read syscall, then the particular I/O subsystem, then the physical implementation of that subsystem. Or if I looked at /proc/$pid/syscall I'd see a read call on a certain fd, and I could look in /proc/$pid/fd/ and see which fd it was and where it lived.

However, in the post-io_uring world, I think I won't see that, right? If I understand right, I'll at most see a call to io_uring_enter, and maybe not even that.

How do I tell what a stuck io_uring-using program is stuck on? Is there a way I can see all the pending I/Os and what's going on with them?

How is this implemented internally - does it expand into one kernel thread per I/O, or something? (I guess, if you had a silly filesystem which spent 5 seconds in TASK_UNINTERRUPTIBLE on each read, and you used io_uring to submit 100 reads from it, what actually happens?)

6 comments

I think that's a very reasonable concern. It however isn't really about io_uring - it applies to all "async" solutions. Even today if you are running async IO in userspace (e.g. using epoll), it's not very obvious where something went wrong, because no task is seemingly blocked. If you attach a debugger, you might most likely see something being blocked on epoll - but a callstack to the problematic application code is nowhere in sight.

Even if pause execution while inside the application code there might not be a great stack which contains all relevant data. It will only contain the information since the last task resumption (e.g. through a callback). Depending on your solution (C callbacks, C++ closures, C# or Kotlin async/await, Rust async/await) the information will be between not very helpful and somewhat understandable, but never on par with a synchronous call.

> Even today if you are running async IO in userspace (e.g. using epoll), it's not very obvious where something went wrong, because no task is seemingly blocked.

It doesn't apply to file IO, which is never non-blocking, and can't be made async with epoll. Epoll always considers files ready for any IO. And if the device is slow, the thread is blocked with dreaded "D" state.

The fundamental problem is that readiness based async IO and random access to not mix well. You'd need a way to poll readiness for different positions in the same file at the same time.

Completion based async (including io_uring on Linux or IO completion ports on Windows) doesn't suffer from this problem.

> It will only contain the information since the last task resumption

That's an implementation detail though. As far as I'm aware python keeps hold of the stack, so it outputs complete stack traces as you'd expect from synchronous code.

You would want to start using the more modern debugging tools, namely dynamic tracing tools like bpftrace[1]. Though in fairness, it might be a tad tricky to get a trace for a specific file without some more complicated scripts.

[1]: https://github.com/iovisor/bpftrace

This is such a great point. Never thought how async I/O could be a problem this way. In the SQ polling example, I used BPF to "prove" that the process does not make system calls:

https://unixism.net/loti/tutorial/sq_poll.html

Could be a good idea to use BPF to expose what io_uring is doing. Just a wild thought.

Good point. Would be great if the submission and completion ring buffers were accessible via procfs.
Could eBPF be used? I'm really not sure myself.
Use timeouts?
How exactly? I/O in TASK_UNINTERRUPTIBLE/TASK_KILLABLE cannot be timed out - so part of my question is how io_uring handles that in general.
If it's just blocked, you could probably look at the io_uring kthreads. But as I mentioned in another comment, bpftrace is probably a more useful tool for things like this (and it's useful for general kernel debugging too!).