Hacker News new | ask | show | jobs
by theparanoid 3484 days ago
Nvidia uses largely the same driver code for both linux and windows in their proprietary driver (I believe they call it a unified driver).

AMD tried the same in their open source driver and were rejected by the kernel maintainer. Unified drivers have code sharing advantages but don't follow the practices of the linux kernel.

3 comments

My naive assumption is that code reuse across platforms is a good thing, I'd love to understand why this isn't the case here or what the concrete arguments are against it.
> My naive assumption is that code reuse across platforms is a good thing, I'd love to understand why this isn't the case here or what the concrete arguments are against it.

A driver is inherently platform-specific. It's glue that ties the hardware to the operating system. The only "correct" way to have one driver work on multiple operating systems is for the operating systems to all use the same driver model.

The ugly way is to create your own hardware abstraction layer and then write a translation layer between that and each operating system, because that's complicated and hideous.

But it's especially silly because Linux accepts suitable contributed code, so you could instead use the native Linux model as your "intermediary layer" and fix Linux if it isn't suitable in some way. And then translate that to what the closed operating system you can't modify uses.

The result is that the Linux people are happier and you have one less translation layer to maintain.

That might run into license issues. If you want to avoid licensing the other versions of your driver under GPLv2, you'd have to carefully avoid copying any code from the main kernel into your translation layer (rewriting any helper functions you end up using), and even then there's the idea of API copyright to contend with.

One might ask whether it is desirable to avoid the GPL, and there are a lot of arguments on both sides there, but it's certainly easy to run into issues when you have a GPL licensed module designed to be linked into a proprietary program (kernel).

> If you want to avoid licensing the other versions of your driver under GPLv2, you'd have to carefully avoid copying any code from the main kernel into your translation layer (rewriting any helper functions you end up using), and even then there's the idea of API copyright to contend with.

Isn't the point supposed to be to not have other versions of your driver, so you can use the same one on every platform?

By "other versions" I mean the codebase used for a given non-Linux platform, which would (hypothetically) include most of the Linux driver's code plus a translation layer from Linux APIs to that platform's.
The translation layer isn't where the interesting bits are. The parts hardware companies want to keep secret are the hardware-specific parts, not the OS-specific parts. It might even help them to open source the translation layers because then others could potentially use them and shoulder some of the maintenance cost.

I can't speak to the legal status of GPL drivers for Windows, but several seem to exist already (e.g. Windows ext4 driver), and if they were actually worried about it they could always get explicit permission from the copyright holders of the relevant code. Either they say yes and you're fine or they say no you know what pieces of code to replace.

>But it's especially silly because Linux accepts suitable contributed code, so you could instead use the native Linux model as your "intermediary layer" and fix Linux if it isn't suitable in some way. And then translate that to what the closed operating system you can't modify uses.

But Linux repesents a tiny portion of the gaming community, so that approach would make no sense at all for a GPU vendor. C'mon.

Then they aren't going to get their driver upstream. End of story. Kernel developers have already done this once (Dave hinted at Exynos drivers in the past in his other posts) and it was a large amount of work to un-screw the pooch once all this crap came along.

I know that Linux people really really just want the kernel to take one for the team so they can have GPUs because that's just the goal, and clearly the goal is good and the means don't matter at all and everything else is irrelevant. 100,000 lines of crap code, 200k? 500k? Who cares, it's all in the name of GPUs clearly. It's obviously worth it no matter what.

But the kernel developers do not see it that way, and for good reason -- because once it's in tree, they are all on the hook for it and they all have to deal with the swamp, the added complexity, the maintenance, the un-fucking of this entire HAL, etc etc.

Having worked on a large open source project, I can assure you, it sucks when you have to say "This isn't acceptable and we aren't merging it", even when it's a feature the users want, and one someone worked on for a long time. It is also, almost always, the right thing to do in the long run (and several of those features did come back, in acceptable ways, in our case).

> But Linux repesents a tiny portion of the gaming community, so that approach would make no sense at all for a GPU vendor. C'mon.

The growth market for GPUs is GPGPU and servers. And Linux represents a large portion of the programming and server communities.

More to the point, as soon as you support Linux at all then it doesn't matter who has more share, it's still less work to do the above than have to maintain another translation layer.

But AMD doesn't. GPGPU is already supported on nvidia drivers with their opaque blob. AMD has a more-transparent blob. People who want this to work already have a solution. This kernel change is probably important to some people, but those who simply want to run a GPGPU cluster on linux already have workable solutions.
The GPGPU market is the polar opposite of the gaming market.

Game developers might like to see clean driver source but they don't get to choose what kind of GPU their customers have already bought. And 99% of gamers are not going to choose their GPU based on Linux drivers. So nobody has any leverage and vendors have no incentive to change.

Meanwhile thousands of universities and institutions are each going to be looking for 25,000 GPUs and they can choose what brand they buy based on what makes their internal developers happy. Hosts like Amazon and Google are each going to be buying millions of GPUs, and having better and more transparent drivers so they can more easily e.g. improve power consumption by a small percentage, can save them a million dollars/year in electricity.

Someone like Google could come to each vendor and say "first to have mainline kernel drivers gets all our business" at any point. Or the same result in the other order; once there are clean drivers third parties are more likely to make power consumption and performance improvements that give AMD the edge when the major customers crunch the numbers.

There is a significant competitive advantage in it for AMD to get this right.

Very good point, there's definitely a growing market for high-bandwidth GPGPU solutions, neural networks is probably just the start.
I agree with you almost entirely, except the part about fixing Linux. If the abstraction that Linux provides isn't suitable for some reason, it probably isn't straightforward to change it because of compatibility with existing code.
That's not so much a concern within the kernel boundary, which is the case that applies here. If you have a compelling reason to redesign an internal API, you "just" have to fix up all the code across the tree that consumes it. Changes are regularly made to the internal VFS interfaces, for example.
It's also often the case that kernel-driver interfaces are extended without breaking compatibility. In those cases, you want to ensure that the extensions are suitable for more than one driver to consume.
Or future changes in the linux target need to be translated to all other target wrappers.
The problem isn't that sharing code across platforms is bad, it's that not sharing code within Linux is bad. Airlie is basically saying that if the kernel API and subsystems are somehow inadequate, AMD should improve them directly instead of covering them up with a bunch more code.
> Airlie is basically saying that if the kernel API and subsystems are somehow inadequate, AMD should improve them directly instead of covering them up with a bunch more code.

And you really believe that the maintainers will be accepting a giant patch that changes the API and subsystem completely (though into something better) that has the risk of causing lots of regressions to existing drivers? And you believe that AMD is supposed to fix all the regressions that are caused in drivers by other vendors that this change causes?

Of course not, the maintainers will accept a well thought out series of patches that each make one small logical change towards the better interface.

And yes - who else is supposed to fix all the regressions caused by changes that AMD wants? Volunteers who would rather work on something else? If you want a change, you get to support the regressions - and if AMD's work gets merged, then anyone ELSE who wants to make a change in that page needs to support AMD's regressions.

Hence wanting to make sure that the changes from AMD are manageable and flexible enough to allow further changes.

> If you want a change, you get to support the regressions

And what about a change to a stable internal kernel API, which the kernel developers refuse?

No, they just want a stable future, around which they can plan the API, but so far no one has delivered on that tiny requirement.
The Linux kernel does not have stable internal kernel APIs.

https://www.kernel.org/doc/Documentation/stable_api_nonsense...

(without looking at details) The problem is that Windows and Linux expose hardware and drivers in different ways. You can shim things up to make the code work, but you end up with a driver that doesn't look like a Linux driver and doesn't work like a Linux driver and can't easily be maintained by people working in the Linux graphics drivers is going to be a problem.

If the driver doesn't really belong in the Linux kernel source for those reasons, it's better to keep it outside the kernel tree.

AIUI, the problem is

code re-use between drivers of different vendors but the same kernel/OS,

VS

code re-use between drivers of the same vendor but different kernels/OSes.

At the end of the day, both sides are arguing for code re-use, of sorts.

The open source developers don't care about invisible code reuse in a closed source driver. HALs across open source codebases do exist too (eg for ZFS) but Linux in particular does not like them.
AMD should move their HAL code into their Windows driver, making it a superset of the Linux driver. AMD would get to reduce driver code duplication and Linux kernel developers don't need to merge the AMD's ugly Linux/Windows HAL.
> AMD should move their HAL code into their Windows driver, making it a superset of the Linux driver.

This might theoretically make sense if the Linux subsystem was very stable over many years. Practice shows that the Windows interfaces are what are a lot more stable over the years and changes in them are communicated for a long time beforehand so that hardware vendors can begin changing their drivers long beforehand.

Regardless of one's thoughts on AMD, I think this is broadly true. Microsoft may do a lot of things poorly, but one thing they are good at (arguably, the only thing they're good at, hell maybe the key to their success, really) is maintaining compatibility and not breaking stuff.
This is explained here: https://www.kernel.org/doc/Documentation/stable_api_nonsense...

Linux maintains compatibility by fixing the driver themselves when they break it. Microsoft cannot (actually, can, and does) break their interfaces since they don't control the drivers.

This allows Linux to keep improving without breaking things in production; while Microsoft has to either maintain huge backward compatibility abstractions for changes, go YOLO and break stuff (often unknowingly) or abstain from improving their OS.

It is a good thing. For the developers of that piece of code (AMD in this case).

However, it is introducing a second API for a very specific subset of hardware into a kernel that is being developed by not just AMD people. Dave Airlie is rightly saying that the second API and hence two different code structures makes the whole DRI infrastructure harder to maintain for everyone else.

And Dave's responsibility is to everyone else, not to AMD.

It is a good thing for the driver writer as they have less difference between their targets.

It is a bad thing for the targets as they implement both the driver functionality and the abstractions required to make the same code work cross platform. The response linked describes the cost of those abstractions to the target (Linux kernel in this case).

I believe this link is the "concrete argument" against a unified abstraction layer in this particular instance.
But Nvidia's proprietary driver is a download right? Why is AMD trying to merge theirs into the kernel?
Nvidia's proprietary driver breaks upon every new kernel release, which is why they have a shim. Furthermore, Nvidia can't ship their driver in the official Linux kernel due to copyright issues, and they're forced to handle all the maintenance burden of their driver (whereas AMD reaps the benefits of Intel's GPU driver bugfixing, and vice versa, thus lowering both Intel's and AMD's driver costs on Linux).

Besides, Nvidia's been having trouble with their Tegra GPUs on Android, and as a result have been forced to pitch in a bit on Nouveau (the reverse-engineered open-source Nvidia driver). They're still having trouble with their driver situation on mobile, as a result of their unwillingness to play ball with the kernel.

Actually, that last sentence above - I'm really not too confident on that, I've heard various hearsay but the only source I concretely remember is the "other drivers" section of http://richg42.blogspot.com.au/2014/05/the-truth-on-opengl-d...

> Furthermore, Nvidia can't ship their driver in the official Linux kernel

Nvidia has every ability to ship it, they just refuse to open it.

probably rightfully so, if this is the sort of welcome they'd get.
Opening their driver's code is different from merging it into the kernel source.
There is huge difference between opening code and merging it to the Linux kernel.
> Nvidia's proprietary driver breaks upon every new kernel release, which is why they have a shim.

ELI5: why does each Linux kernel release break driver code? It can't be THAT hard to just have a stable interface and leave it for long periods of time, e.g. only bumping it on major version bumps in the Kernel?

Because in practice, APIs inside Linux do, in fact, change quite a bit -- and by itself maybe that wouldn't matter so much, but the nvidia driver has an insane amount of surface area on top of it. It's a massive driver. You can imagine then, that breaking it is actually easier than you might think.

There is no rule kernel interfaces can only change on major bumps. In reality, they change quite frequently, as new APIs and drivers are merged in, which requires generalization, refactoring, etc across API boundaries to keep things sane. Kernel developers specifically reject the notion of a "stable ABI" like this because they feel it would tie their hands, and lead them to design APIs and workarounds for things which would otherwise be fundamentally simple if you "just" break some function and its call sites. APIs in Linux tend to organically grow, and die, as they are needed, by this logic.

Why wait 5 years for a "major version bump" to delete an API call, you could just do it today and fix the callers, since they're all right there in the kernel tree? It's far easier and more straightforward to do this than attempting to work around "stable" systems for very long periods of time, which is likely to accumulate cruft.

Because they do not care about out-of-tree code, when an API changes, their obligations are to refactor the code using that API, inside the kernel, and nothing else. That means the person making the change also has to fix all the other drivers, too, even if they don't necessarily maintain them. Out of tree users will have to adapt on their own.

This also explains why they do not want a HAL. When a Linux driver interface changes, the person changing it is responsible for changing everything else and fixing other drivers. That means if AMD wants a large change, it may have to go and touch the Intel driver and refactor it to match the new API. If Intel wants something new, they may have to touch the AMD driver in turn. This, in effect, helps reduce the burden and share responsibilities among the affected people.

They don't want a HAL because a HAL is a massive impediment to exactly that workflow. If Intel wants to improve a DRM/DRI interface in the kernel for their GPUs, they could normally do so and touch all the other drivers. Out with the old, in with the new. But now, they'd have to also wade through like 50,000 lines of AMD abstraction code that no other system, no other driver, uses. It effectively makes life worse for every graphics subsystem maintainer when this happens, except for AMD I guess since they can pawn off some of the work. But if AMD plays by the rules -- Intel fixing their AMDGPU driver when they make a change shouldn't be that unusual, or any more difficult compared any other graphics driver. And likewise -- AMD making a change and having to fix Intel's driver? That's just par for the course.

Obviously Linux isn't perfect here and they do, and have, accepted questionable things in the past, or have rejected seemingly reasonable API changes out of stability fear (while simultaneously not wanting a stable ABI -- which is fair). But the logic is basically something like the above, as to why this is all happening.

AMD's recent strategy has been to try to confine the proprietary stuff to userspace, and to implement an open-source kernel driver that can be used by either the proprietary userspace driver or open-source userspace stack.
It's already part of the kernel, this was a re-architecture of the display portion of the driver.
The amdgpu driver is open source, not proprietary.
Open source != free software.
The difference between open source and free software is mostly in the political camp the word comes from. Read the OSI definition of open source if you don't believe:

> https://opensource.org/osd-annotated

The reasons why "free software" people don't like the word "open source" are indeed political:

> https://www.gnu.org/philosophy/open-source-misses-the-point....

For software for which the source code is available, but does not give the four freedoms:

> https://www.gnu.org/philosophy/free-sw.html#content

it is common to use the word "shared source" (originally devised by Microsoft):

> https://en.wikipedia.org/wiki/Shared_source

Kernel driver of proprietary AMDGPU-PRO are licensed exactly same way as modules in mainline kernel. Most of them dual-licensed under MIT and GPL so BSD and other projects can use them.
But it is free software. It resides in the kernel tree: https://git.kernel.org/cgit/linux/kernel/git/stable/linux-st...
https://www.phoronix.com/scan.php?page=article&item=amd_cata...

Note that both the amd and the nvidia kernel modules always have been FOSS because of the GPL license. It's just that nvidia provides it by its own ways, not through the official linux branch, and thus doesn't have to respect linux rules nor to document the driver.

Note that both the amd and the nvidia kernel modules always have been FOSS because of the GPL license.

Only open source part of their modules was shim while 99% of driver is contained in blob. That's true for both Nvidia or ATI/AMD fglrx.

Maybe a stupid question but why is it a big deal? Who else does this effect other then nvidia and amd?
I think the idea is that the kernel maintainers can't break the code that the AMD driver relies on and in order to properly do that they need to be able to easily grok all the driver implementations, and abstraction layers make that more difficult.
I don't know anything about linux display architecture, but this point sounds really weird to me from general software engineering perspective. Isn't one of the goals of having drivers in an OS to establish a formalized interface between drivers and kernel, and thus achieve separation of concerns between driver maintainers and kernel maintainers? Requiring that kernel maintainers understand how all drivers work does not sound very scalable.
Linux developers want to be able to share code across drivers for similar devices, and they want to be able to refactor and improve that shared code without worrying about out-of-tree drivers. That strategy has worked well for eg. WiFi drivers where the shared mac80211 subsystem allows a lot of logic to be pulled out of individual NIC drivers, and improvements to mac80211 more or less automatically benefit all participating drivers.
Makes sense, but it doesn't have to be mutually exclusive. It is possible to have a fixed network driver interface, and some common helpers orthogonal to it. This way driver developers could chose whether to benefit from common code or not. I guess if linux developer want to enforce code sharing, this wouldn't work, but I wonder why they would do it. Seems like it just makes life harder for both parties.
Many network drivers in Linux do opt out of the common frameworks, and changes like BQL [0] and some of the recent WiFi restructuring [1] that require driver-level modifications are usually not introduced as a mandatory compatibility break that requires modifying every driver at once.

Sometimes drivers opt out because the hardware is odd enough that it needs to do things differently, or because it offloads to hardware (or proprietary firmware) functionality that is usually done in software. But often, it's just because a driver was written in isolation by the manufacturer and then dumped on the community. (See [2], from the comments in [1], about the work required to clean up some Realtek WiFi drivers enough to be merged to the kernel staging area.) If a driver unnecessarily opts out of common frameworks and does things internally and differently, it can be hard to even evaluate whether problems fixed in the standard frameworks exist in the special snowflake drivers. Even after identifying a problem, the recipes that worked to fix the standard drivers won't apply.

[0] https://lwn.net/Articles/454390/

[1] https://lwn.net/Articles/705884/

[2] https://www.linuxplumbersconf.org/2016/ocw//system/presentat...

Linux very intentionally and explicitly do not guarantee a stable interface for drivers. If you want to maintain your driver out of tree, then that's your right, but you have to be prepared for Linux kernel maintainers to break your driver regularly. The point is they are not interested in sacrificing ability to improve the kernel just in order to make it easier for people to maintain proprietary drivers.

Proprietary drivers are tolerated, not liked, and people aren't interested in making it easier for them.

It's exactly opposite with Linux, where stable kernel API/ABI are avoided as a matter of principle
Internal API/ABI is broken regularly, but Linus is fairly clear on the subject of breaking userspace - https://lkml.org/lkml/2012/12/23/75
I wouldn't say avoided.. A lot of the interfaces are still compatible with prior versions as is the software that runs on it, or we'd be on Kernel v50+ by now. Not all software, but a bit.

That said, the community isn't afraid of breaking changes to push future versions forward.

>A lot of the interfaces are still compatible with prior versions as is the software that runs on it, or we'd be on Kernel v50+ by now.

Linux doesn't follow SemVer.

That's a stable internal kernel API/ABI that's not provided. The stable external (userspace-facing) ABI is very much maintained.
But the internal APIs are what driver programmers usually care about. So your argument is void.
I agree that this is how it seems it should be. I was mostly basing my comment off of this: https://lists.freedesktop.org/archives/dri-devel/2016-Februa...
Thank you. This comment clarifies things quite a bit for me.
there are many other gpu vendors, e.g. in arm cores. I guess this is the code tree in question: https://github.com/torvalds/linux/tree/master/drivers/gpu/dr...
Intel for one.