Hacker News new | ask | show | jobs
by basilgohar 720 days ago
I love this insider view into this interesting point in computing history, especially about AMD. However, I was a little put off by the glorification of nVidia's shady practices and lock-in policies as key to their current leading position. While technically true, I dislike "ends justify the means"-style thinking.

All this as the OP glorifies AMD's engineering and grit-based culture to drive through all though tough missteps and missed opportunities.

To expand on that, I really do feel AMD has great engineering culture but they keep falling to the same traps. They do not invest strongly enough in software support nor vendor relationships. Neither of these necessitate the more evil monopolistic practices of vendor lock-in and proprietary, non-free (as in libre) software. If they can navigate that without turning evil, they'd be a company for the ages.

And I can't close with mad respect to Dr. Lisa Su for her admirable leadership, itself bookworthy. Also, quick fact, she and Jensen are cousins!

7 comments

On the other hand, AMD was on the brink of bankruptcy and Lisa Su led them out of it and into a triple-digit share price. Most companies with that much debt and that little revenue would have gone bankrupt.

Lest we forget the Intel IPC advantages over comparable AMD CPUs was due to some shortcuts that exposed major vulnerabilities in Intel CPUs made from ~2011 to 2019. I’d be curious to see how a Spectre and Meltdown-patched Intel CPU fares against its AMD competitor NOW. Some of the performance hits were brutal- 20%+ in some workloads.

Nvidia was pushing AMD out of the GPU market back when GPUs were effectively only used for gaming and while GameWorks was predatory, you can’t really blame them for having the cooler-running, quieter, more energy-efficient GPUs going back to the Maxwell line (GTX 9x0). CUDA didn’t screw AMD until recently… but in 2014, people were picking Nvidia because the GPUs were considerably “better”. AMD had the best bang for buck back then, but you’d have more power consumption and heat output, and the drivers tended to be buggy. The bugs would be fixed, but it really sucked for people trying to play games on release day.

Nvidia was pushing CUDA forward for over a decade before it started getting serious commercial traction. It's not like they blocked anyone else from developing viable GPGPU tech, they were just the only ones pushing it.

For like 8 years their drivers on Linux were a nightmare and AMD could have come in and done better.

> For like 8 years their drivers on Linux were a nightmare and AMD could have come in and done better.

AMD eventually did while Nvidia's drivers remained a nightmare almost until these days. But sure, AMD could have done it sooner.

> AMD eventually did while Nvidia's drivers remained a nightmare

and yet that trillion-dollar valuation built over the last decade is built with customers almost entirely running on those "nightmare" linux drivers, while AMD's linux drivers crash running the sample app on supported hardware+OS, and nobody at AMD cared until finally a tech-bro with a loud enough platform shamed them into fixing it...

... and this is something like AMD's third crack at the apple, and the first three sets of drivers (one of which is literally a Vulkan-branded spec) are just as non-functional today as rocm was a year ago.

(OpenCL, Fusion HSA/AMD APP, Vulkan Compute/SPIR-V... all still broken so badly that Octane called them out for being unable to successfully compile their renderer and for lack of vendor support, so badly that Blender pulled support after years of turbulent and poorly-performing attempts to work with AMD, etc)

Nvidia only cares about a specific market. I.e. it doesn't care about desktop users. That's what I was talking about. So despite their pools of cash, Nvidia is a trash company when it comes to Linux support.
AFAIK a lot of Hollywood visual effects are done on Linux + Nvidia so they probably support that market.
Not really familiar with it, but hopefully they can get unstuck from Nvidia, especially on Linux. Only very recently things started improving it seems and not even with Nvidia's effort but outside community working on nova + nvk.
AMD and Apple tried to push OpenCL but the design of it, a C-like kernel compiled to the GPU with LLVM and managed by the Khronos consortium, tended to lag in absolute performance to CUDA which was able to take advantage of evolutions in GPU design more closely.

Nowadays almost nobody cares about OpenCL.

The feature lag wasn't the problem, the bugs were the problem: the only reliable OpenCL implementation was the one from Nvidia, but this meant it tended to drive people towards Nvidia rather than steal them away.
Also apparently the reason behind Apple's cut with Khronos seems to be related to how OpenCL was managed by them.
"Hey Khronos, can we tweak the OpenCL spec to be even more restrictive and higher-level, then rebrand it under our proprietary 'Metal' architecture so we can license it out to our competitors?"

"...no, but you could expand on OpenCL or Vulkan compute if you wanted. There are other spec stakeholders, we can't give you carte-blanche control, Apple."

"Why do you insist upon mismanaging the industry's APIs? Screw you guys!" <Beginning of mid 2010s "Khronos Drought" at Apple Computers>

The obvious issue with both your points is that NVidia's competitors did do as such. AMD has had workable Linux drivers for many years now and there were numerous alternatives to CUDA pushed.
A common talking point is that CUDA is a formidable moat for Nvidia, but - as someone who has never done AI dev - I'm curious to understand what makes CUDA so sticky. From an outsider perspective it looks like a re-run of DirectX vs. everything else but AI is not like gaming and end users often don't have to run the model themselves. So it seems like the network effects should be less than that for a graphics APIs.
I don't know how it is nowadays but i remember trying CUDA back when GeForce GTX 280 was still a high end GPU. I didn't do anything fancy, i just tried to write a simple raytracer to get a feel of how it'd work.

The experience was incredibly simple: write C like usual but annotate a few C functions with some extra keywords and compile using a custom frontend/preprocessor/whatever-nvcc-was instead of gcc (i was on Linux - and BTW i heavily contest the notion that Nvidia drivers on Linux were "nightmare", they always worked just fine with both performance and features comparable to their Windows counterparts while ATi/AMD had buggy and broken drivers for years). Again, the experience was very simple, i even just copy/pasted a bunch of existing C code i had and it worked.

Later i tried to use OpenCL which was supposedly the open alternative. That one felt way more primitive and low level, like writing shaders without the shading bits.

In a way, as you wrote, it was kinda like DirectX: that is, CUDA was like using OpenGL 1.1 with its convenient and straightforward C API and OpenCL was like using DirectX 3 with its COM infested execute buffer nonsense.

After that i never really used CUDA (or OpenCL for that matter) but it gave me the impression that Nvidia did put way more effort on developer experience.

Nvidia have invested a lot in CUDA, and they have C & Fortran bindings for a lot of scientific stuff, apart from all the DL/Gen AI stuff that's super hot right now.

Like, I started using CUDA (through frameworks) over ten years ago, and basically nobody has come up with anything competitive since then.

> Nvidia have invested a lot in CUDA,

This is a significant understatement. For quite some time Jensen has been saying repeatedly that 30% of their R&D spend is on software. With the money-printing machine that is Nvidia if that holds they're going to continue to rocket ahead of competitors in terms of delivering actual solutions.

The "What are you talking about? AMD/Intel runs torch just fine!" crowd clearly haven't seen things like RIVA, Deepstream, Nemo, Triton Inference Server/NIM, etc. Meanwhile AMD (ROCm) still struggles with flash attention...

What these hardware-first (only?) companies like AMD don't seem to understand is that people buy solutions, not GPUs. It just so happens that GPUs are the best way to run these kinds of workloads but if you don't have a wholistic and exhaustive overall ecosystem you end up in single digit market share vs Nvidia at ~90%.

chicken and egg arguments.. good points and not untrue, but look elsewhere in this topic and see extensive anti-trust behavior, questionable license practices, deceptive public statements and deceptive handling of binary blobs. Very much like Intel - excellent tech in certain places, very mob-like business behavior in other places.

"What are you talking about? AMD/Intel runs torch just fine!" refers indirectly to the value of having competition in markets, not jump on the (well-funded,slick) monopoly bandwagon.

Tooling.

Since CUDA 3.0, NVidia has embraced a polyglot stack, with C, C++ and Fortran at the center, and PTX for anyone else.

Followed by changing CUDA memory model to map that of C++11.

Khronos never cared for Fortran, and only designed SPIR, when it became obvious they were too late to the party.

So not only has CUDA first level tooling for C, C++, Fortran, with IDE integration in Visual Studio and Eclipse, graphical GPU debugger with all the goodies of a modern debugger, it also welcomes any compiler toolchain that wants to target PTX.

Java, Haskell, .NET, Julia, Python JITs, .... there are plenty to chose from, without going through "compile to OpenCL C99" alternative.

Finally, the myriad of libraries to chose from.

CUDA is not only for AI, by the way.

The real moat of CUDA is that CUDA... works. Simply works out of the box, even on cheapest GPUs. Unless you want some specific high end stuff, everything will work on the cheapest GPU of given generation, with the same base tooling.

And because of that, their OpenCL implementation also works better than others. So there's more tooling not just from nvidia using it, because it. just. works.

Compare this with AMD, whose latest framework is a total mess of "will it work on this GPU?", sometimes needing custom wrangling to enable, etc. etc. and it's effectively supported only on the most expensive compute-only cards.

The difference is not just about APIs; CUDA has a single source file model that is dead easy to use whereas last I checked every competitor still had an outdated manual loading process that adds significant friction.
Doesn't SYCL also allow for a single-source-file model these days?
It is supposed to, yes. I was never able to set it up (admittedly I have not tried in a couple of years since I am not working with GPUs anymore) so I don't know how well it holds up.
On the GPU area AMD lost, and will continue to lose to Nvidia, because they don't seem to get a grip on Software and Drivers. And that does not bode well for their long time CEO.
Just the first link review you posted reinforces my argument:

"...But we must now talk about the elephant in the room, and that is AMD’s software stack. While it is absolutely night and day from where it was when we tested MI210, ROCm is nowhere near where it needs to be to truly compete with CUDA..."

You're pointing at the sun and saying "see, it is bright!". Nobody is pretending that AMD does not need to fix their software stack.

AMD did not really turn their attention to AI until about Oct of last year. Now that they have, it will take a bit of time to correct the course of the ship, but I know for certain that it is all hands on deck at this point. One sign of this is that we're seeing more frequent and substantial "night and day" improvements to ROCm.

The lifecycle of hardware, is years. MI300x is a substantial leap. MI325x is another one. The rest of the hardware roadmap (years out), is extremely impressive. Software is a much shorter lifecycle and can be iterated on more easily. Expect to continue to see improvements here over the coming years.

I mean for gaming workloads AMD GPUs are doing fine in the Xbox, PlayStation, and Steam Deck consoles.
And in PC AMD has 15.6% of the market, compared to Nvidia's 76.4% according to Steam.

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

The more this is blindly repeated the more you know it's bs
She turned the company around and got it on the right path, but in interviews I get the feeling that she might also be responsible for the "Hardware 1st, 2nd, 3rd, 4th.... eh, maybe software can be 5th" culture and AMD's deep denial that it has a problem.

https://news.ycombinator.com/item?id=40790924

That was OK for the CPU turnaround, but on the GPU front it completely shut them out of the first rounds of the AI party and maybe a trillion in market cap.

I'm hopeful and optimistic for AMD but if anything were to make me bearish on their prospects, it'd be this.
Yeah, I really feel like AMD is struggling with the software aspect. Even back when they were ATI and AMD bought them, the ATI drivers were garbage compared to Nvidia (from my PC gaming experience). After a few AMD and ATI cards, I just accepted the Nvidia tax, where my cards are more expensive and on paper worse, but in practice worked better.

I'm really surprised AMD isn't throwing a whole bunch of money on emulating CUDA. If they could "just" make CUDA work on AMD cards, it feels like Nvidia's position would be severely weakened.

Kind of like how Valve invested heavily into Proton and now gaming on Linux is pretty much fine.

I'm not sure emulating CUDA would be legal, you can look at ZLUDA as an example. It was originally funded by AMD, but got cut for what I presume would be legal reasons. ZLUDA does work amazingly well though from my experience!
AMD also doesn't understand that CUDA got big because they worked on cheap consumer cards; once things were working people got interested in expensive specialized cards. Their stack is still focused on the high end only, but there's no ecosystem to support it.
To me, this is the most important point and what AMD is missing out from their current strategy. I can take an off the shelf, easy to get 4070 or 4080 and use it with CUDA to learn.

AMD's strategy for people wanting to learn, is basically no strategy.

It's always been the software holding them back, still is, need to invest in the ecosystem and not just the things easy to justify as a revenue driver.

That is what ROMc and HIP were supposed to be somehow, but even that isn't really CUDA, as in the polyglot programming language environment, with C, C++ and Fortran first, plus others, followed by Python JIT, libraries, IDE, and a GPU graphical debugger.
>However, I was a little put off by the glorification of nVidia's shady practices and lock-in policies as key to their current leading position.

What was their shady practices and lock-in policies?

> However, I was a little put off by the glorification of nVidia's shady practices and lock-in policies as key to their current leading position. While technically true, I dislike "ends justify the means"-style thinking.

Personally, I have no issue with "ends justify the means"-style thinking as a blanket rule, often it's perfectly appropriate.

I would argue it is, in this case, where Nvidia was playing a game by the rules. If there is an issue with how they played, then government should change the rules.

The people in power in the US don't want that though.