| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by geofft 2232 days ago

One thing this writeup made me realize is, if I have a misbehaving I/O system (NFS or remote block device over a flaky network, dying SSD, etc.), in the pre-io_uring world I'd probably see that via /proc/$pid/stack pretty clearly - I'd see a stack with the read syscall, then the particular I/O subsystem, then the physical implementation of that subsystem. Or if I looked at /proc/$pid/syscall I'd see a read call on a certain fd, and I could look in /proc/$pid/fd/ and see which fd it was and where it lived.

However, in the post-io_uring world, I think I won't see that, right? If I understand right, I'll at most see a call to io_uring_enter, and maybe not even that.

How do I tell what a stuck io_uring-using program is stuck on? Is there a way I can see all the pending I/Os and what's going on with them?

How is this implemented internally - does it expand into one kernel thread per I/O, or something? (I guess, if you had a silly filesystem which spent 5 seconds in TASK_UNINTERRUPTIBLE on each read, and you used io_uring to submit 100 reads from it, what actually happens?)

6 comments

Matthias247 2232 days ago

I think that's a very reasonable concern. It however isn't really about io_uring - it applies to all "async" solutions. Even today if you are running async IO in userspace (e.g. using epoll), it's not very obvious where something went wrong, because no task is seemingly blocked. If you attach a debugger, you might most likely see something being blocked on epoll - but a callstack to the problematic application code is nowhere in sight.

Even if pause execution while inside the application code there might not be a great stack which contains all relevant data. It will only contain the information since the last task resumption (e.g. through a callback). Depending on your solution (C callbacks, C++ closures, C# or Kotlin async/await, Rust async/await) the information will be between not very helpful and somewhat understandable, but never on par with a synchronous call.

link

WGH_ 2231 days ago

> Even today if you are running async IO in userspace (e.g. using epoll), it's not very obvious where something went wrong, because no task is seemingly blocked.

It doesn't apply to file IO, which is never non-blocking, and can't be made async with epoll. Epoll always considers files ready for any IO. And if the device is slow, the thread is blocked with dreaded "D" state.

link

CodesInChaos 2230 days ago

The fundamental problem is that readiness based async IO and random access to not mix well. You'd need a way to poll readiness for different positions in the same file at the same time.

Completion based async (including io_uring on Linux or IO completion ports on Windows) doesn't suffer from this problem.

link

Doxin 2232 days ago

> It will only contain the information since the last task resumption

That's an implementation detail though. As far as I'm aware python keeps hold of the stack, so it outputs complete stack traces as you'd expect from synchronous code.

link

cyphar 2232 days ago

You would want to start using the more modern debugging tools, namely dynamic tracing tools like bpftrace[1]. Though in fairness, it might be a tad tricky to get a trace for a specific file without some more complicated scripts.

[1]: https://github.com/iovisor/bpftrace

link

shuss 2232 days ago

This is such a great point. Never thought how async I/O could be a problem this way. In the SQ polling example, I used BPF to "prove" that the process does not make system calls:

https://unixism.net/loti/tutorial/sq_poll.html

Could be a good idea to use BPF to expose what io_uring is doing. Just a wild thought.

link

matheusmoreira 2232 days ago

Good point. Would be great if the submission and completion ring buffers were accessible via procfs.

link

ecnahc515 2232 days ago

Could eBPF be used? I'm really not sure myself.

link

dirtydroog 2232 days ago

Use timeouts?

link

geofft 2232 days ago

How exactly? I/O in TASK_UNINTERRUPTIBLE/TASK_KILLABLE cannot be timed out - so part of my question is how io_uring handles that in general.

link

cyphar 2232 days ago

If it's just blocked, you could probably look at the io_uring kthreads. But as I mentioned in another comment, bpftrace is probably a more useful tool for things like this (and it's useful for general kernel debugging too!).

link