Hacker News new | ask | show | jobs
by georgeecollins 473 days ago
I would really love it if people on Hacker News could weigh in on how much of a moat they think CUDA really is. As in: How hard is it to use something else? If you started a project today how much would you want to get paid to not use CUDA?

A lot of readers on this site have a good insight into this and it is a key question financial people are asking without the knowledge many people here possess.

6 comments

SemiAnalysis has a nice write-up on MI300X vs H100/H200 and concludes that the CUDA moat is still very real: https://semianalysis.com/2024/12/22/mi300x-vs-h100-vs-h200-b...

"As fast as AMD tries to fill in the CUDA moat, NVIDIA engineers are working overtime to deepen said moat with new features, libraries, and performance updates."

AMD's competitor to CUDA is ROCm. Historically, AMD has been hobbled by the quality of their drivers and because they sold less performant hardware. AMD has traditionally been the budget option for both CPUs and GPUs. Things have changed in the CPU space because of Ryzen, but sadly AMD has not been able to realize an equivalent competitive advantage in the GPU space. Intel has also entered the GPU market, but they are even farther behind than AMD. The same problems I am about to describe apply to them as well, to a higher degree.

Rewriting CUDA programs to run using ROCm is expensive and time consuming. It is difficult to justify this expense when in all likelihood the ROCm version will be less efficient, less performant, and less stable than the original. In the grand scheme of things, AMD hardware is indeed cheaper but it's not that much cheaper. From a business standpoint, it's just not worth it.

Knowing what I know about how management thinks, even if AMD managed to make an objectively superior product at a much better price, institutional momentum alone would keep people on CUDA for a long time.

    AMD has been hobbled by the quality of their drivers 

I always hear this and I believe it, but I've never been able to find any insight about what exactly is holding them back.

Given the way nVidia is printing money, surely it absolutely cannot be a lack of motivation on AMD's part?

This is a very uninformed thought as I have no experience writing drivers, nor am I familiar with the various things supported by CUDA and ROCm. But how is AMD struggling with ROCm compute drivers, when their game drivers have been plenty stable as far as I have experienced? Surely the surface area of functionality needed for the graphics drivers is larger and therefore the compute drivers should be a relatively easier task? Or am I wrong and CUDA has a bunch of higher-level stuff baked into it and this is what AMD struggles to match?

     and because they sold less performant hardware.
Does anybody have and insight into specifically what part of compute performance AMD is struggling to match? Did AMD bet on the wrong architectural horse entirely? Are they unable to implement really basic compute primitives as efficiently as they want because nVidia holds key patents? Did nVidia lock down the entire pool of engineers who can implement this shit in a performant way?

I mean, aside from GPU compute stuff, it sure looks to me like AMD is executing well. It doesn't seem like they're a bunch of dunces over there. Quite the opposite?

Never underestimate the power of institutional momentum! cough IBM AS400
One aspect that influences is how close to the bleeding edge one needs to be. And how niche the model/application is. ROCm lags by some years. And application/model/framework developers test less on it, which can be problematic in niches. For doing something very established like say image classification, that does not really matter - 3 year old CNNs will generally do the trick. But if on wants to drop in model X just put on GitHub/HuggingFace the last year, one would be buying a lot of trouble.
> could weigh in on how much of a moat they think CUDA really is.

There's movement to implement CUDA libraries that work on non-Nvidia cards, but I guess adoption could be hindered by legal fears.

https://github.com/vosen/ZLUDA

Wrong project. ZLUDA will never support enterprise.

What you're looking for is SCALE...

https://docs.scale-lang.com/

and they are making amazing progress.

Whenever a new AI model gets released and is available for the public. From the last few I've tried they were always NVIDIA only because I assume that's what the researchers had at their disposal.
So why give away the valuable knowledge away for free?
It's the hacker philosophy, isn't it?