Hacker News new | ask | show | jobs
by nneonneo 1281 days ago
It sounds like the pause/unpause might be the way to fix this properly, since trying to be heuristically smarter sounds like a recipe for never-ending corner case bugs like the OP’s issue.

The patch for pausing and unpausing seems quite reasonable, except that it does require driver support (unsurprising - you’re literally reallocating the resources used by the driver!). I suppose if you had at least a few movable devices then you should be ok in the event of a hotplug event, so you’d have to hope that enough drivers bother to support the feature.

I wonder what is necessary to get people to care about the patch enough to fix it up and mainline it? I suppose the problem it fixes is still niche enough that not so many people are clamoring for the fix.

4 comments

The PCI resource allocation code is fairly intricate and everyone is scared that changing it may cause regressions. Sergei's patch set is quite intrusive and it would be necessary to somehow break it up into smaller pieces that are slowly fed into mainline over several release cycles, always watching out for regression reports. So, the problem is known, but the engineers working on PCI code in the kernel are given higher priority stuff to work on by their employers, hence the issue hasn't gotten the attention it deserves.

Actually I forgot to mention there's another solution: A PCIe feature called Flattening Portal Bridge (PCIe Base Spec r6.0 section 6.26). That was introduced with PCIe 5.0. It's more likely that FPB support is added in mainline than the pause/unpause feature. It's supported by recent Thunderbolt chips and it's an official feature of the PCIe standard, so companies will prefer dedicating resources to it rather than some non-standard approach.

In the dynamic use cases, the PCIe specs is kind of shabby on the addressing space: it is theorically fixed by FPB.

I guess this is sorry for those niche hardware use cases.

Isn't FPB into PCIe 4.0? (I am not a SIG member, cannot read the specs).

I meant, I know about PCIe addressing (from the web, linux code, and a book I read years ago), but I cannot read the modern specific FPB specs.
Would a workaround be that whenever the kernel detects this happening (and it did, it dmesg printed it) that it somehow increases an internal counter so on next reboot there will be more resources?

This would require the kernel being able to either update its own command line somehow, or having some permanent storage somewhere it could store it.

Or this could all be done by systemd - detect that message, increase the resource, next reboot will fix it.

Kernel state does not survive reboots afaik.

That would need help from userland, which is not involved in the early boot process.

You could I guess change kernel init parameters and save that in your boot loader, but that is very hackish.

Maybe it can be introduced gradually, making the reallocation an optional feature that a driver might support. Then drivers can independently implement the resource reallocation feature.

Mainline drivers can move gradually. If they want to be nice for out-of-tree drivers then they can describe a timeline for deprecating and removing the support for non-reallocating drivers.

What is the point in having all of the drivers be open sourced and mainlined if we're not willing to fix them to support this?
> What is the point in having all of the drivers be open sourced and mainlined if we're not willing to fix them to support this?

With open source and mainlined drivers, it's very difficult to change all the drivers and ensure they work.

Without open source and mainlined drivers, it becomes impossible.

Possibly it is hard, tedious, or the people able to fix it don’t think it is worth the effort.

Open source projects rely on volunteers mostly so it isn’t like there’s some outside force to appeal to. If nobody volunteers a solution, then it isn’t important enough to solve. The point is that, if it were important enough to fix, anybody with the requisite skills could do so.