Hacker News new | ask | show | jobs
by CyberRabbi 1630 days ago
> As for putting things in threads, I would consider it a huge hack to move open/close. Threads are not and will never be mandatory to have great responsiveness.

The POSIX interface was invented for batch processing. Long running non-interactive jobs. This is why it lacks timing requirements. All well-designed interactive GUI applications do not interact with the file system on their main thread. This is especially true for game display loops. The fundamental problem here is that they are doing unbounded work on a thread that has specific timing requirements (usually 16.6ms per loop). As I’ve said elsewhere, this bug will still manifest itself no matter how fast you make close(), just depends on how many device files are present on that particular system. It’s a poor design. Well designed games account for every line of code run in their drawing loop.

> This is absolutely a kernel bug.

I don’t think that is proven unless the original author can chime in. It’s your best guess and opinion that the author intended to not block on synchronize_rcu but it’s perfectly possible they did indeed intend the code as written. synchronize_rcu is used in plenty of other critical system call paths in similar ways, not every one of those uses is a bug. I would guess you might be slightly suffering from tunnel vision a bit here given how the behavior was discovered.

If it is indeed the case the synchronize_rcu is taking up to 50ms I would suspect there is a deeper issue at play on this machine. By search/replacing the call with call_rcu or similar you may just be masking the problem. RCU updates should not be taking that long.

1 comments

> All well-designed interactive GUI applications do not interact with the file system on their main thread

I strongly disagree. A well-designed interactive GUI application can absolutely interact with the filesystem on its main thread without any impact to responsiveness what-so-ever. You only need threads once you need more CPU time.

The POSIX interfaces provide sufficient non-blocking functionality for this to be true, and the (as per the documentation, "brief") blocking allowed by things like open/close is not an issue.

(io_uring is still a nice improvement though.)

> I don’t think that is proven unless the original author can chime in.

This argument is nonsense. Whether or not code is buggy does not depend on whether or not the author comments on the matter. This is especially true for a project as vast as the Linux kernel with its massive number of ever-changing authors.

> If it is indeed the case the synchronize_rcu is taking up to 50ms I would suspect there is a deeper issue at play on this machine. By search/replacing the call with call_rcu or similar you may just be masking the problem. RCU updates should not be taking that long.

synchronize_rcu is designed to block for a significant amount of time, but I did not push the patch further exactly because I would like to dig deeper into the issue rather than making a text-book RCU fix.

> A well-designed interactive GUI application can absolutely interact with the filesystem on its main thread without any impact to responsiveness what-so-ever. You only need threads once you need more CPU time.

The "well-designed" argument here is a bit No True Scotsman, and absolutely not true. Consider a lagging NFS mount. Or old hard drives; a disk seek could take milliseconds!

Real time computing isn't about what is normal or average, it's about the worst case. Filesystem IO can block, therefore you must assume it will.

> The "well-designed" argument here is a bit No True Scotsman, and absolutely not true.

This counter arguments can be interpreted as a mere No True Scotsman of "responsiveness", so this is not a very productive line of argument.

Should one be interested in having a discussion like this again, I would suggest strictly establishing what "responsive" means (which is a subjective experience), including defining when a "responsive" application may be "unresponsive" (swapping to disk, no CPU/GPU time, the cat ate the RAM), and evading terms like "well-designed" (I included it in protest of its use in the comment I responded to).

For example, failing to process input or skipping frames in gameplay would be bad, but no one would see a skipped frame in a config menu, and frames cannot even be skipped if there are no frames to be rendered.

> Should one be interested in having a discussion like this again, I would suggest strictly establishing what "responsive" means (which is a subjective experience)

This has been established for years. This is the basis of building real time systems. For example, Flight control systems absolutely must be responsive, no exceptions. What does that mean? That the system is guaranteed to respond to an input within a maximum time limit. POSIX applications may generally give the appearance of being responsive but absolutely are not unless specially configured. There is no upper bound on how long any operation will complete. This will be apparent the minute your entire system starts to choke because of a misbehaving application. Responsive systems have a hard bound on worst case behavior.

> A well-designed interactive GUI application can absolutely interact with the filesystem on its main thread without any impact to responsiveness what-so-ever. You only need threads once you need more CPU time.

Hmm. If you call open()/read()/close() on the main thread and it causes a high latency network operation because that user happens to have their home directory on a network file system like NFS or SMB, your application will appear to hang. When you design applications you can’t just assume your users have the same setup as you.

> The POSIX interfaces provide sufficient non-blocking functionality for this to be true

POSIX file system IO is always blocking, even with O_NONBLOCK. You can use something like io_uring to do non blocking file system io but that would no longer be POSIX.

> Whether or not code is buggy does not depend on whether or not the author comments on the matter.

That would depend on if you knew more about how the code is intended to work than the original author of the code. Do you presume to know more about how this code is intended to work than the original author?

> That would depend on if you knew more about how the code is intended to work than the original author of the code. Do you presume to know more about how this code is intended to work than the original author?

I am not sure if you are suggesting that only the author can know how code is supposed to work, that finding bugs require understanding of the code strictly superior to the author, or that the author is infallible and intended every behavior of the current operation.

Either way, this attitude would not have made for a healthy open source contribution environment.

> that finding bugs require understanding of the code strictly superior to the author,

Evaluating whether or not something is a bug in a specific part of a system absolutely requires understanding the intent of the code equal to the author. You have found undesirable application-level behavior and have attributed the cause to a specific line of code in the kernel but it’s possible you are missing the bigger picture of how everything is intended to work. Just because latency has been tracked down to that line of code does not mean the root source of that latency is that line of code. Symptoms vs root causes.