Hacker News new | ask | show | jobs
by jgord 825 days ago
I dont understand this - arent almost all ML NN models built in pytorch, and arent these compiled / jit'd into a lower level format - and can we not have various backends/drivers for that, such as CUDA / ROCM / vnni ?

The article is unsatisfying because it doesnt explain WHY cuda reigns supreme.

One hypothesis put forward is that the main alternative ROCM is just not very complete and not very fast - thats a good argument.

Another hypothesis that is not considered is : CUDA reigns supreme, because NVIDIA GPUs reign supreme.

But people dont write CUDA code .. they write pytorch code ?!

6 comments

Nobody else seems to be willing to invest serious funding, including market rates for SWEs, into compelling alternatives. I believe AMD's TC for senior software engineers tops out at 200k in the Bay Area.

The problems you generally experience are:

  * Inexplicably poor performance
  * Poor (and sometimes incorrect) documentation
  * Difficulties debugging
  * Crashes and hangs
Why is this? AI is going to a multi trillion market. I can't think of anything else bigger except maybe electricity, real estate, and food.

If I'm AMD, I'd spend at least $1 billion/year figuring out the software side.

I can't think of an easier way for AMD to return value to shareholders than eroding CUDA advantage.

Heck, Meta invested something like $100b on VR so far and VR is not nearly the market that AI is.

No device support...

I started playing around with porting some CUDA code to ROCm/HIP on a Ryzen laptop APU I had. While an "unsupported" configuration (which was understood), it all worked until AMD suddenly and explicitly blocked the ability to run on APU's. Currently the only way to get back to work on that project on that particular computer would be to run a closed-source patched driver from some rando on the internet. Needless to say, I lost interest.

Last I checked, there were only 7 consumer SKU's that could run AMD's current compute stack, the oldest being 1 generation old. Even among the enterprise hardware they only support ~2 generations back. So you can't even grab some old cheap recycled gear on e-bay to hack on their ecosystem.

Meanwhile, I can pull anything with an NVIDIA logo on it from a junkyard it'll happily run CUDA code that I wrote for the 8800GTX 15+ years ago.

I'm an AI compiler engineer and AMDs hiring process was ... Non-competitive. Companies are hiring left and right at a fast clip and heres AMD wanting you to fly out in a month. I love their CPUs but... Come on. You gotta be serious to compete
You really mean TC and not base salary? That’s shockingly bad.
It's also not correct. I wouldn't consider myself a "Senior" engineer, but am at AMD and have a TC notably higher than that.
They do write CUDA code, oh boy do they ever. PyTorch is just a coordinator for CUDA or sometimes Metal kernels. New AI architectures and algorithms often end up needing a new or tuned kernel. Look at Flash Attention for an example of one of those that had a big impact.
The tooling around ROCm is not as good (debuggers, profilers etc), and at least in my tangential experience (that is, involving GPGPU computation, but not for ML), custom operations are faster when written in CUDA code than in a high level Python wrapper (or, for that matter, using tools like OpenMP). Just as we write all our actually performance demanding code in C/C++, we write all our performance sensitive GPU code in CUDA (and obviously, performance is the entire point of putting in the effort to write GPU code).
The world of GPU programming is more than just PyTorch for starters.

Then there is the quality of hardware, debugging tools, IDE support, supported languages (again isn't only PyTorch), and libraries.

Yeah, on paper there is in reality there isn't.
Surprisingly, there exist people doing GPU programming outside of ML. I work in high performance computing and lots of people write CUDA code.
I wonder where does Mojo (new programming language by Chris Lattner's company) fit in all this? Their promise is to be a super-set of Python (like C++ was to C) and resolve all hardware interface issues.

I know its still in development. But curious to know if someone has played around with it for the kind of needs discussed on this page.

> I dont understand this - arent almost all ML NN models built in pytorch, and arent these compiled / jit'd into a lower level format - and can we not have various backends/drivers for that, such as CUDA / ROCM / vnni ?

PyTorch already does. But if you're saying "NN" and "pytorch" that already means you're outside of the audience for CUDA I'm talking about in the article. My own stuff was usually Bayesian Hierarchical Models, which at least at the time made pytorch completely useless (that was nearly a decade ago though—maybe that specific use case improved).

If you've tried to write actually new (or different enough) NNs or entirely different models, pytorch is too high-level, and sometimes even TF is too. Even aside from that, if you're a maintainer of BLAS or some specific library for sparse MM with very specific distributions that are optimized for it...

Anyway, those are the key cases, but even aside from that, if you've ever tried even with some higher-level libraries to do non-vanilla stuff, nothing works as well as it should. You get random, inscrutable errors that certainly do exist on NVIDIA GPUs/stuff-based-on-CUDA-under-the-hood, but way way fewer. For newer, custom stuff, getting things like numerical overflows or other completely breaking problems on alternative backends, but don't happen / work just fine on CPU or CUDA backend is not really that uncommon. Or the CUDA backend is just ridiculously faster. If you're doing something annoying, new, and complicated enough, there's no point in taking the aggravation.

The people who write the stuff that is used in PyTorch or other libraries definitely write CUDA code (in C++ etc). And then the people who use PyTorch just build on top of that.

I deliberately tried to keep it accessible and have non-technical (or just non-software) audiences also be able to get an intuition for why CUDA has such strong lock-in. Otherwise, the pushback I've often gotten "just re-write it" or "it's just software" which if it were so simple, people wouldn't need to be yelling so much at AMD across so many comments. Basically, people who can't fathom why software technical debt can ever be a thing. Or, if it is, China has infinite money and time anyway.

A high-level analysis should say that Huawei, AMD, and Intel all should easily invest enough to make this all work and compete with CUDA to push their hardware platforms. The reality is decentralized decision-making from users also makes it more of an expensive, uncertain bet that people will adopt. A bunch of the lower-level, underlying libraries that things are built on AND the researchers who do bleeding-edge research still have a huge amount of experience in and stuff built on CUDA.