Very short - AMD made an abstraction layer, to share effort between Linux and other platforms (i.e. Windows and etc.). Kernel/DRM maintainers don't like that, since it causes several issues detailed in that thread (harder to understand logic of the driver, slowdown of DRM improvement itself, indirect workflow of AMD developers and so on). For the reference, DRM here is Direct Rendering Manager[1], nothing to do with crooked Digital Restrictions Management.
Nvidia uses largely the same driver code for both linux and windows in their proprietary driver (I believe they call it a unified driver).
AMD tried the same in their open source driver and were rejected by the kernel maintainer. Unified drivers have code sharing advantages but don't follow the practices of the linux kernel.
My naive assumption is that code reuse across platforms is a good thing, I'd love to understand why this isn't the case here or what the concrete arguments are against it.
> My naive assumption is that code reuse across platforms is a good thing, I'd love to understand why this isn't the case here or what the concrete arguments are against it.
A driver is inherently platform-specific. It's glue that ties the hardware to the operating system. The only "correct" way to have one driver work on multiple operating systems is for the operating systems to all use the same driver model.
The ugly way is to create your own hardware abstraction layer and then write a translation layer between that and each operating system, because that's complicated and hideous.
But it's especially silly because Linux accepts suitable contributed code, so you could instead use the native Linux model as your "intermediary layer" and fix Linux if it isn't suitable in some way. And then translate that to what the closed operating system you can't modify uses.
The result is that the Linux people are happier and you have one less translation layer to maintain.
That might run into license issues. If you want to avoid licensing the other versions of your driver under GPLv2, you'd have to carefully avoid copying any code from the main kernel into your translation layer (rewriting any helper functions you end up using), and even then there's the idea of API copyright to contend with.
One might ask whether it is desirable to avoid the GPL, and there are a lot of arguments on both sides there, but it's certainly easy to run into issues when you have a GPL licensed module designed to be linked into a proprietary program (kernel).
> If you want to avoid licensing the other versions of your driver under GPLv2, you'd have to carefully avoid copying any code from the main kernel into your translation layer (rewriting any helper functions you end up using), and even then there's the idea of API copyright to contend with.
Isn't the point supposed to be to not have other versions of your driver, so you can use the same one on every platform?
By "other versions" I mean the codebase used for a given non-Linux platform, which would (hypothetically) include most of the Linux driver's code plus a translation layer from Linux APIs to that platform's.
>But it's especially silly because Linux accepts suitable contributed code, so you could instead use the native Linux model as your "intermediary layer" and fix Linux if it isn't suitable in some way. And then translate that to what the closed operating system you can't modify uses.
But Linux repesents a tiny portion of the gaming community, so that approach would make no sense at all for a GPU vendor. C'mon.
Then they aren't going to get their driver upstream. End of story. Kernel developers have already done this once (Dave hinted at Exynos drivers in the past in his other posts) and it was a large amount of work to un-screw the pooch once all this crap came along.
I know that Linux people really really just want the kernel to take one for the team so they can have GPUs because that's just the goal, and clearly the goal is good and the means don't matter at all and everything else is irrelevant. 100,000 lines of crap code, 200k? 500k? Who cares, it's all in the name of GPUs clearly. It's obviously worth it no matter what.
But the kernel developers do not see it that way, and for good reason -- because once it's in tree, they are all on the hook for it and they all have to deal with the swamp, the added complexity, the maintenance, the un-fucking of this entire HAL, etc etc.
Having worked on a large open source project, I can assure you, it sucks when you have to say "This isn't acceptable and we aren't merging it", even when it's a feature the users want, and one someone worked on for a long time. It is also, almost always, the right thing to do in the long run (and several of those features did come back, in acceptable ways, in our case).
> But Linux repesents a tiny portion of the gaming community, so that approach would make no sense at all for a GPU vendor. C'mon.
The growth market for GPUs is GPGPU and servers. And Linux represents a large portion of the programming and server communities.
More to the point, as soon as you support Linux at all then it doesn't matter who has more share, it's still less work to do the above than have to maintain another translation layer.
But AMD doesn't. GPGPU is already supported on nvidia drivers with their opaque blob. AMD has a more-transparent blob. People who want this to work already have a solution. This kernel change is probably important to some people, but those who simply want to run a GPGPU cluster on linux already have workable solutions.
I agree with you almost entirely, except the part about fixing Linux. If the abstraction that Linux provides isn't suitable for some reason, it probably isn't straightforward to change it because of compatibility with existing code.
That's not so much a concern within the kernel boundary, which is the case that applies here. If you have a compelling reason to redesign an internal API, you "just" have to fix up all the code across the tree that consumes it. Changes are regularly made to the internal VFS interfaces, for example.
It's also often the case that kernel-driver interfaces are extended without breaking compatibility. In those cases, you want to ensure that the extensions are suitable for more than one driver to consume.
The problem isn't that sharing code across platforms is bad, it's that not sharing code within Linux is bad. Airlie is basically saying that if the kernel API and subsystems are somehow inadequate, AMD should improve them directly instead of covering them up with a bunch more code.
> Airlie is basically saying that if the kernel API and subsystems are somehow inadequate, AMD should improve them directly instead of covering them up with a bunch more code.
And you really believe that the maintainers will be accepting a giant patch that changes the API and subsystem completely (though into something better) that has the risk of causing lots of regressions to existing drivers? And you believe that AMD is supposed to fix all the regressions that are caused in drivers by other vendors that this change causes?
Of course not, the maintainers will accept a well thought out series of patches that each make one small logical change towards the better interface.
And yes - who else is supposed to fix all the regressions caused by changes that AMD wants? Volunteers who would rather work on something else? If you want a change, you get to support the regressions - and if AMD's work gets merged, then anyone ELSE who wants to make a change in that page needs to support AMD's regressions.
Hence wanting to make sure that the changes from AMD are manageable and flexible enough to allow further changes.
(without looking at details) The problem is that Windows and Linux expose hardware and drivers in different ways. You can shim things up to make the code work, but you end up with a driver that doesn't look like a Linux driver and doesn't work like a Linux driver and can't easily be maintained by people working in the Linux graphics drivers is going to be a problem.
If the driver doesn't really belong in the Linux kernel source for those reasons, it's better to keep it outside the kernel tree.
The open source developers don't care about invisible code reuse in a closed source driver. HALs across open source codebases do exist too (eg for ZFS) but Linux in particular does not like them.
AMD should move their HAL code into their Windows driver, making it a superset of the Linux driver. AMD would get to reduce driver code duplication and Linux kernel developers don't need to merge the AMD's ugly Linux/Windows HAL.
> AMD should move their HAL code into their Windows driver, making it a superset of the Linux driver.
This might theoretically make sense if the Linux subsystem was very stable over many years. Practice shows that the Windows interfaces are what are a lot more stable over the years and changes in them are communicated for a long time beforehand so that hardware vendors can begin changing their drivers long beforehand.
Regardless of one's thoughts on AMD, I think this is broadly true. Microsoft may do a lot of things poorly, but one thing they are good at (arguably, the only thing they're good at, hell maybe the key to their success, really) is maintaining compatibility and not breaking stuff.
It is a good thing. For the developers of that piece of code (AMD in this case).
However, it is introducing a second API for a very specific subset of hardware into a kernel that is being developed by not just AMD people. Dave Airlie is rightly saying that the second API and hence two different code structures makes the whole DRI infrastructure harder to maintain for everyone else.
And Dave's responsibility is to everyone else, not to AMD.
It is a good thing for the driver writer as they have less difference between their targets.
It is a bad thing for the targets as they implement both the driver functionality and the abstractions required to make the same code work cross platform. The response linked describes the cost of those abstractions to the target (Linux kernel in this case).
Nvidia's proprietary driver breaks upon every new kernel release, which is why they have a shim. Furthermore, Nvidia can't ship their driver in the official Linux kernel due to copyright issues, and they're forced to handle all the maintenance burden of their driver (whereas AMD reaps the benefits of Intel's GPU driver bugfixing, and vice versa, thus lowering both Intel's and AMD's driver costs on Linux).
Besides, Nvidia's been having trouble with their Tegra GPUs on Android, and as a result have been forced to pitch in a bit on Nouveau (the reverse-engineered open-source Nvidia driver). They're still having trouble with their driver situation on mobile, as a result of their unwillingness to play ball with the kernel.
> Nvidia's proprietary driver breaks upon every new kernel release, which is why they have a shim.
ELI5: why does each Linux kernel release break driver code? It can't be THAT hard to just have a stable interface and leave it for long periods of time, e.g. only bumping it on major version bumps in the Kernel?
Because in practice, APIs inside Linux do, in fact, change quite a bit -- and by itself maybe that wouldn't matter so much, but the nvidia driver has an insane amount of surface area on top of it. It's a massive driver. You can imagine then, that breaking it is actually easier than you might think.
There is no rule kernel interfaces can only change on major bumps. In reality, they change quite frequently, as new APIs and drivers are merged in, which requires generalization, refactoring, etc across API boundaries to keep things sane. Kernel developers specifically reject the notion of a "stable ABI" like this because they feel it would tie their hands, and lead them to design APIs and workarounds for things which would otherwise be fundamentally simple if you "just" break some function and its call sites. APIs in Linux tend to organically grow, and die, as they are needed, by this logic.
Why wait 5 years for a "major version bump" to delete an API call, you could just do it today and fix the callers, since they're all right there in the kernel tree? It's far easier and more straightforward to do this than attempting to work around "stable" systems for very long periods of time, which is likely to accumulate cruft.
Because they do not care about out-of-tree code, when an API changes, their obligations are to refactor the code using that API, inside the kernel, and nothing else. That means the person making the change also has to fix all the other drivers, too, even if they don't necessarily maintain them. Out of tree users will have to adapt on their own.
This also explains why they do not want a HAL. When a Linux driver interface changes, the person changing it is responsible for changing everything else and fixing other drivers. That means if AMD wants a large change, it may have to go and touch the Intel driver and refactor it to match the new API. If Intel wants something new, they may have to touch the AMD driver in turn. This, in effect, helps reduce the burden and share responsibilities among the affected people.
They don't want a HAL because a HAL is a massive impediment to exactly that workflow. If Intel wants to improve a DRM/DRI interface in the kernel for their GPUs, they could normally do so and touch all the other drivers. Out with the old, in with the new. But now, they'd have to also wade through like 50,000 lines of AMD abstraction code that no other system, no other driver, uses. It effectively makes life worse for every graphics subsystem maintainer when this happens, except for AMD I guess since they can pawn off some of the work. But if AMD plays by the rules -- Intel fixing their AMDGPU driver when they make a change shouldn't be that unusual, or any more difficult compared any other graphics driver. And likewise -- AMD making a change and having to fix Intel's driver? That's just par for the course.
Obviously Linux isn't perfect here and they do, and have, accepted questionable things in the past, or have rejected seemingly reasonable API changes out of stability fear (while simultaneously not wanting a stable ABI -- which is fair). But the logic is basically something like the above, as to why this is all happening.
AMD's recent strategy has been to try to confine the proprietary stuff to userspace, and to implement an open-source kernel driver that can be used by either the proprietary userspace driver or open-source userspace stack.
The difference between open source and free software is mostly in the political camp the word comes from. Read the OSI definition of open source if you don't believe:
Kernel driver of proprietary AMDGPU-PRO are licensed exactly same way as modules in mainline kernel. Most of them dual-licensed under MIT and GPL so BSD and other projects can use them.
Note that both the amd and the nvidia kernel modules always have been FOSS because of the GPL license. It's just that nvidia provides it by its own ways, not through the official linux branch, and thus doesn't have to respect linux rules nor to document the driver.
I think the idea is that the kernel maintainers can't break the code that the AMD driver relies on and in order to properly do that they need to be able to easily grok all the driver implementations, and abstraction layers make that more difficult.
I don't know anything about linux display architecture, but this point sounds really weird to me from general software engineering perspective. Isn't one of the goals of having drivers in an OS to establish a formalized interface between drivers and kernel, and thus achieve separation of concerns between driver maintainers and kernel maintainers? Requiring that kernel maintainers understand how all drivers work does not sound very scalable.
Linux developers want to be able to share code across drivers for similar devices, and they want to be able to refactor and improve that shared code without worrying about out-of-tree drivers. That strategy has worked well for eg. WiFi drivers where the shared mac80211 subsystem allows a lot of logic to be pulled out of individual NIC drivers, and improvements to mac80211 more or less automatically benefit all participating drivers.
Makes sense, but it doesn't have to be mutually exclusive. It is possible to have a fixed network driver interface, and some common helpers orthogonal to it. This way driver developers could chose whether to benefit from common code or not. I guess if linux developer want to enforce code sharing, this wouldn't work, but I wonder why they would do it. Seems like it just makes life harder for both parties.
I wouldn't say avoided.. A lot of the interfaces are still compatible with prior versions as is the software that runs on it, or we'd be on Kernel v50+ by now. Not all software, but a bit.
That said, the community isn't afraid of breaking changes to push future versions forward.
Short version: AMD as a company is dysfunctional. Perfectly happy to throw away $300Mil on failed acquisition, unwilling to hire sufficient number of competent driver developers.
Where did you get this idea from? Just because a patch set got rejected, we can dismiss the fact that AMD had been providing excellent open source driver support for years, and just call them incompetent?
AMD is prioritizing the business here, which makes perfect sense. Why would they spend more money than necessary to appease the Linux maintainers in order to serve the tiny population of Linux gamers who probably don't give a crap about how the kernel is maintained? Their business is on Windows.
Neither AMD nor Nvidia develop Linux drivers for Linux gamers. Not that they don't pay attention to us with fixes and optimizations here and there, but a good part of that is games simply being applications that make use of a lot of driver functionality that might not receive enough testing otherwise.
1. https://en.wikipedia.org/wiki/Direct_Rendering_Manager