Hacker News new | ask | show | jobs
by adrian_b 972 days ago
Because the Linux kernel does not have a stable API for modules.

The modules included in the kernel are updated whenever the API changes.

For the external modules, every time when a new kernel version is released, they may become broken and there is a lot of work to identify how to update the modules, unless one watches the kernel mail lists every day, because there is also no documentation about how to migrate the old modules to the new kernel.

The reasons for module breakage may be just the movement of some definitions from one header to another, which break compiling, but most frequently some structures gain new members or lose old members, or some functions gain new arguments or lose old arguments.

When there are new members or arguments it is hard to discover what values should be put in them, while when members or arguments are deleted it is hard to discover whether their absence must be compensated somehow, e.g. by inserting invocations to other functions.

In the worst case everything can be solved by reading the kernel source, but that takes a lot of time and those who maintain out-of-kernel modules usually do not do this as a full-time job, so they do not have time to scan every day the kernel mail lists, to see if anyone has plans to make changes that will break their modules.

3 comments

A knock-on effect of this is that it's easiest to get the drivers into the mainline kernel, where you have to license your code as GPL. I'm not sure that that's intentional, but it does help make source code available for more drivers. A stable ABI would likely see many abandonware driver blobs.
That's 100% intentional. In any case, non-GPL modules cannot be upstreamed (except with a license that is compatible with the GPL, e.g. BSD is okay), and cannot touch a lot of parts of the kernel.

Also, loading a non-GPL module "taints" your kernel (this cannot be undone except through reboot) and tells everyone on the support mailinglists that you've loaded proprietary code.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux...

And there is much hating on Nvidia, for using a "workaround". Including, famously:

https://www.youtube.com/watch?v=tQIdxbWhHSM

Also famously, this is the reason ZFS is not in the linux kernel.

> Also, loading a non-GPL module "taints" your kernel (this cannot be undone except through reboot) and tells everyone on the support mailinglists that you've loaded proprietary code.

Except that the Linux kernel falsely taints itself with the "proprietary" bit when a module with a non-proprietary, non-GPL compatible license is loaded.

But of course, not being GPL compatible does not necessarily imply that it's proprietary, e.g. the CDDL (as used by ZFS) is a non-GPL compatible, free software license (according to the FSF).

And furthermore, this is documented and intentional.

> cannot touch a lot of parts of the kernel

Ironically, this is a form of DRM.

Not, really as a one-line change to the source code can easily change this behavior. In short, you can run proprietary modules easily, on all hardware the real kernel supports. Try it. No harder than a recompile.

You just can't distribute kernels with proprietary modules outside your organisation, and you have to upgrade and support the modules yourself. Which, I'd bet, effectively means Google, Meta, Samsung & Oracle and other behemoths can and do use proprietary modules.

The main complaint is that, say, any kind of small hardware vendor effectively cannot have proprietary modules in the kernel.

And this is generally regarded as a very good thing, a nice way to deal with a very real threat to open source.

How does this manage to scale? With even a modest number of drivers, you'd think the maintenance burden would be unreasonably high in terms of needing to have someone available who understands how a set of drivers work, knows how to test them, and keeps them up-to-date.
That is why everybody attempts to push their modules into the kernel source.

Those who cannot do that because their modules are not open source, like NVIDIA, have a lot of work to do for module maintenance.

Nevertheless, there are also open-source modules that cannot be pushed into the kernel because they are for uncommon hardware with few users, and those also require a lot of work for maintenance, so they are frequently abandoned and no longer brought up-to-date, to be compatible with the latest kernels.

> That is why everybody attempts to push their modules into the kernel source.

> Those who cannot do that because their modules are not open source, like NVIDIA, have a lot of work to do for module maintenance.

Say someone gets a driver for a WiFi USB stick into the kernel source and leaves it for others to maintain. Wouldn't the maintainers need the hardware device for testing changes and an understanding of the hardware's quirks to keep the driver updated? Obviously the system works but I don't get how it doesn't suffer from software rot, especially when physical hardware is required for robust testing including testing the hardware against different motherboards.

What I've seen is (more or less) the mainline kernel maintainers for whichever kernel have the objective of making sure the drivers compile.

They can't test what they don't have, and I haven't seen evidence than any open source kernel has a testinfg time with large fleet of hardware for tests. There's more or less an incidental distributed test fleet as users have various testing setups, etc; but driver mistakes in obscure hardware tends to get found late.

It's not unusual for drivers to be found to have been broken for years, and then unless the reporter can fix it, or the fix is obvious from the report, the driver gets removed.

Another path towards driver removal is some interfaces get updated in multiple steps: a new interface is added, drivers are updated to the new interface, the old interface and any drivers that haven't been updated are removed. If nobody has stepped up to update the driver before removal, that's an indication of disuse; sometimes they get added back after removal though.

For the record, Windows isn't necessarily better. I had a printer where there was a working driver, but Windows would prefer the non-working driver from Windows Update, and it was impossible to get that fixed.

You are correct. The answer is that it does suffer from software rot.
> How does this manage to scale?

"Okay" to "very poorly", depending on your view. And for the record, many times the drivers go untested, frankly. I've absolutely found completely broken drivers in the kernel (as in just obvious lock imbalances where unlock() can't possibly ever get called, null pointer derefs; simple things you can spot in a code review) in the kernel where the break was introduced as some part of a refactoring and the person doing the work made a mistake. Because they also refactored 30 other drivers in the same patch series to match an API change. (The filesystem developers have said many times in patch reviews that certain things are massive chores because they have to go fix 40+ filesystem drivers every time some API break happens.)

Linux actually has tons of hardware regressions because of this whole design choice. When you totally rewrite a piece of code that is shared among multiple components, that can be OK, as long as you preserve the external behavior that the previous interface exposed. An easy way to do that is to establish a contract between the implementation and call sites, which is often reflected as a stable API. The stable API makes some contracts explicit, by construction. But many times API refactorings will happen at the same time, and that in practice typically introduces new behaviors into the downstream code that previously didn't exist before. These new behaviors, when introduced into an existing driver that was not developed for them, tends to cause bugs when not tested fully.

A big reason Linux chooses not to have stable internal APIs is for agility, more or less. But nothing is free and this is the price that is paid as the project grows.

My personal poster child for this stuff is the amdgpu driver. I use a Navi workstation card (WX5500) in my server and whether or not the amdgpu driver functions correctly on updates is a crap shoot. A while back it went completely headless; when I upgraded to 6.5 like 2 weeks ago, and I had to attach a monitor to look at the kernel logs (network config snafu), my dmesg had 7 kernel faults in its log from amdgpu. Seven! For a 3.5 year old card! With no desktop environment! Despite the fact that the card isn't changing, the driver is changing; new hardware support, expanded interfaces, new features, and those cause regressions. Many subsystems get overhauls and behavioral changes every release, so these things are bound to happen. Testing is not uniform; many parts of the kernel are far more well tested than others. Peripherial drivers are another example of easily broken code (Xilinx code upstream is frankly broken garbage half the time.)

The Linux Kernel is a pretty amazing project in many respects but I'm frankly astonished half the time it actually boots to a working desktop successfully.

My company maintains an out-of-tree kernel module for changed block tracking. It requires a full-time Linux driver developer to keep it compiling and working on new kernels.

Our module is GPL FWIW but I doubt it would be accepted into the kernel tree as it is. We need to have a driver developer on staff anyway to support it so it works out okay for us. But if we ever stop maintaining it, it will bitrot quickly and stop working on new kernels.

Coccinelle semantic patching is pretty cool for making systematic changes across the kernel (eg. adding a parameter to the same function everywhere).

https://coccinelle.gitlabpages.inria.fr/website/

Maybe it's untenable, counterproductive or simply a bad idea :-)

But has there ever been any attempt to create an abstraction layer in the kernel that does provide a stable ABI? Something that could be used for certain classes of driver?

I don't know the exact rule, but "abstraction layers" for a driver are disallowed. I believe it was AMD or ATI that added a new driver for a device, but they basically just took the code from their proprietary driver, and then put an "abstraction layer" in the middle. That did not make the kernel overlords happy and that version of the code was never merged.
Isn’t that effectively what NVidia does anyway? Some minimal shim that lets their blob do all of the work?
Are abstraction layers still disallowed if they are only able to load GPL code?
> Are abstraction layers still disallowed if they are only able to load GPL code?

How would you possibly implement this as a technical restriction?