Hacker News new | ask | show | jobs
by bsder 825 days ago
CUDA is a moat because AMD and Intel are run by morons^W^W^W run by people who can't swallow the fact that software is more important than hardware.

Intel should be shoveling out 16GB Arc graphics cards for free to every graduate program in the country who can fill out a web form. In a couple years, they'd displace NVIDIA.

AMD needs to be funding a CUDA shim that allows people to port stuff directly to their cards. And they need to NOT be segmenting the consumer and professional cards software ecosystems.

Yes, there has been progress. However, when you look at the amount of money that AMD and Intel throw at software vs how much NVIDIA throws at software, it's an instant facepalm moment.

NVIDIA is 100% vulnerable--if it weren't for the fact that their competitors are idiots.

4 comments

>NVIDIA is 100% vulnerable--if it weren't for the fact that their competitors are idiots.

I think Nvidia sees it too. That's why they're moving upstream by providing the entire stack from CUDA, GPUs, interconnects chips, networking chips, racks, OS, software, models.

I think the "CUDA moat" people like OP are underselling Nvidia. They're positioning themselves as the full-stack AI provider. Forget CUDA.

The moat that CUDA currently provides is what gives nvidia the room to move up. CUDA is a stepping stone - something stable they can rely on to cement an even higher position (tell me that the full stack is less vendor locked-in than CUDA, and i'll have a bridge to sell you).
By the time some companies figure out how to replace CUDA, Nvidia has already moved well above the CUDA layer.
How hard do you think it is to find engineers who are

- Great at legacy C++ code.

- Great at new C++ code.

- Great at embedded/high performance/distributed code.

- Are experts in Linear Algebra and Calculus

- Are competent at Machine Learning and similar problems.

Now imagine, that after you find ~10-50 competent senior engineers who can each segment and train 1-5 engineers, you also need to hire 10-20 managers, PMs and directors who are smart enough to do more than "copy NVidia's offering from last year", and wise enough to still build a 1:1 compatibility layer.

Apple is likely seeing more traction on their metal API by virtue that it is reasonably well guaranteed to be around in ~5 years, and is common on multiple device platforms that students/devs use or customers deploy.

All this describes video game programmers to me (well 4/5 at least). Given that there's been thousands of game layoffs recently anyone looking to build their AI teams should be diving through linkedin looking for laid off video game programmers.
My understanding of the games industry is that a small fraction of game programmers are working on core game engines and low level graphics kernels. Is that inaccurate?
Even people working on the higher level stuff have more exposure to matrix multiplication than people working on CRUD apps.
We're talking about trillion dollars of market cap here. If the difficulty is in hiring up to ~70 people, with somewhat but not obscure skills, perhaps the executives should be revisited.
It's kind of hilarious to be saying that Apple is more likely to be seeing traction on Metal of all things, when all but the last one of those requirements fit graphics programmers in Vulkan or DirectX, both of which have far more traction than Metal, and that last requirement is pretty easy to pick up if you're an expert in linear algebra and calculus.

It gets even stranger when considering that as major GPU makers, both AMD and Intel have lots of access to such talent.

Vulkan only has traction on Android, and a couple of Linux titles.

Metal has 20% of the desktop market, and whole of the iOS/iPad/watchOS markets combined.

Even with Android market share, many folks keep using OpenGL ES, because Vulkan tooling on Android sucks and isn't available to Java/Kotlin developers like OpenGL ES is, so only game engines like Godot/Unreal/Unity make use of Vulkan in practice.

CUDA is a shallow moat whose effectiveness depends entirely on NVidia convincing people to be mortally fearful of water.
Genuine questions. What are your use cases? What do you do? How much experience?

My personal experience shows CUDA to in fact be a very deep moat. In ~12 years CUDA and ~6 ROCm (since Vega) I’ve never met a professional who says otherwise, including those at top500.org AMD sites.

From what I’ve seen online this take really seems to come from some kind of Linux desktop Nvidia grudge/bad experience or just good ‘ol gaming/desktop team red vs green vs blue nonsense.

Many things can be said about Nvidia and all kinds of things can be debated but suggesting that Nvidia has > 90% market share simply and solely because people drink Nvidia kool-aid is a wild take.

I have 40+ yrs of HPC/AI apps/performance engineering experience & I was one of the 1st people to port LAPACK and a number of other numerical libs to CUDA. Moreover, many of those major DoE + AI sites are my customers.

You should not confuse AMD's general & long-standing indifference/incompetence wrt SW with the actual difficulty of providing a portable SW path for acceleration. As Woody Allen once said: "90% of success is showing up"

But what happened in AI, when, in a very short period of time, almost everyone moved away from writing their directly in CUDA, to writing them in frameworks like Tensorflow & PyTorch is all the evidence anyone need to show just how unsound that SW obstacle is.

I'm working on a project ATM at one of the DoE sites you're likely referring to... Maybe we'll bump into each other!

Ah yes, pytorch:

1) Check issues, PRs, etc on torch Github. Considering market share ROCm has a multiple of the number of open and closed issues. There is still much work to be done for things as basic as overall stability.

2) torch is the bare minimum. Consider flash attention. On CUDA just runs of course with sliding window attention, ALiBi, and PagedAttention. ROCm fork? Nope. Then check out the xFormers situation on ROCm. Prepare to spend your time messing around with ROCm, spelunking GH issues/PRs/blogs, etc and going one by one through frameworks and libraries instead of `pip install` and actually doing your work.

3) Repeat for hundreds of libraries, frameworks, etc depending on your specific use case(s).

Then, once you have a model and need to serve it up for inference so your users can actually make use of it and you can get paid? With CUDA you can choose between torchserve, HF TEI/TGI, Nvidia Triton Inference Server, vLLM, and a number of others. vLLM has what I would call (at best) "early" support that requires patches to ROCm, isn't feature complete, and regularly has commits to fix yet another show-stopping bug/crash/performance regression/whatever.

Torch support is a good start but it's just that - a start.

I almost spew my coffee when reading your grand parent comments.

One of the first teams that ported LAPACK to CUDA or CULA are apparently being paid handsomely by Nividia [1],[2].

Interestingly, DCompute is a little known effort to support compute on CUDA and OpenCL in D language, and it was done by a part-time undergrad student [3].

I strongly believe we need a very capable language to make advancement much easier in HPC/AI/etc, and D language fit the bill very much and then some. Heck it even beat other BLAS libraries that other so called data languages namely Matlab and Julia still heavily depended on for their performances to this very day. It does it in style back in 2016 more than seven years ago [4]. The DCompute implementation by the part-timer in 2017 actually depended on this native D implementation of these linear algebra routines in Mir [5].

[1] CULA: hybrid GPU accelerated linear algebra routines:

https://www.spiedigitallibrary.org/conference-proceedings-of...

[2] CUDA Spotlight: John Humphrey:

https://www.nvidia.com/content/cuda/spotlights/john-humphrey...

[3] DCompute: GPGPU with Native D for OpenCL and CUDA:

https://dlang.org/blog/2017/07/17/dcompute-gpgpu-with-native...

[4] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen:

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...

[5] DCompute: Native execution of D on GPUs and other Accelerators:

https://github.com/libmir/dcompute

I got paid to do the LAPACK port, back in the mid 2000s, for a federal contractor working on satellite imaging type apps. I was still a good coder, back then... Took me about a month, as I recall. Maybe 6 weeks.

But I'm one of those old-school HPC guys who believes that libraries are mostly irrelevant, and absolutely no substitute for compilers and targeted code generation.

Julia is cool, btw. It could very well end up supplanting Fortran, once they fix the poor performance code generation issues.

I think you are right on the libraries, that's why there's currently an initiative in D eco-system to have D compiler DMD as a library, and the aim is probably for compiler should be the only way to run the library without extra code [1].

I really wished any modern language should try supplanting Fortran for HPC and personally my bet is on D.

[1]DMD Compiler as a Library: A Call to Arms:

https://news.ycombinator.com/item?id=39465838

It's effectiveness depends on there being nothing on the other side.
Mostly true, and you'll get no argument from me on the AMD & Intel are fuckwits front. Intel does ok, but AMD in particular has completely dropped the ball on the SW front, and has been doing so for at least 25 yrs.

The point I was glibly trying to get across was that even a small effort on the part of AMD to treat the SW side as seriously as NVidia does would have yielded great benefits, and not have left them so far behind.

Also, there is a lot of work going on in the gcc & llvm toolchain to not only use OpenMP to target accelerators in computationally intensive loops but, in the case of llvm, to also target tensor instructions for more efficient code generation (https://lists.llvm.org/pipermail/llvm-dev/2021-November/1537...).

It took the AI folk less than 18 months to almost completely move away from CUDA to Tensorflow and then PyTorch... LLVM, imho, is going to do the same for Sci/Eng and general code bases in the next 2 years.

Nvidia are arge contributors to these llvm changes, so again I don't see how this is going to hurt them.
Never said it would hurt NVidia, only that CUDA itself isn't as strong a barrier as people seem to think it is.

But with GPU target support in LLVM, in most cases you won't need to resort to CUDA anymore.

> AMD needs to be funding a CUDA shim that allows people to port stuff directly to their cards. And they need to NOT be segmenting the consumer and professional cards software ecosystems.

Isn't that what HIPIFY does? https://github.com/ROCm/HIPIFY

problem being that despite years of work and despite all the marketing hype, it’s still missing basic feature that are over 10+ years old on the nvidia side. If you can’t do dynamic parallelism then kernels can never launch kernels, for example. It has “partial support” for texture unit access. Inter-process communication is not supported. Etc.

https://rocm.docs.amd.com/projects/HIP/en/latest/user_guide/...