Hacker News new | ask | show | jobs
by elorant 829 days ago
CUDA is a big reason for their moat. And that's not something you can build in a couple of years no matter how money you can throw on it.

Without CUDA you have a chip that runs on premise without anyone having a clue how good that is which is supposedly what Google does. Your only offering is cloud services. As big as this is, corporations would want to build their own datacenters.

2 comments

Sure, CUDA has a lot of highly optimized utilities baked-in (CUDNN and the likes) and maybe more importantly, implementors have a lot of experience with it but afaict everyone is working on their own HAL/compiler and not using CUDA directly to implement the actual models. It's part of the HAL/framework. You can probably port any of these frameworks to a new hardware platform with a few man-years worth of work imo if you can spare the manpower.

I think nobody had the time to port any of these architectures away from CUDA because: * the leaders want to maintain their lead and everyone needs to catch up asap so no time to waste, * and progress was _super_ fast so doubly no time to waste, * there was/is plenty of money that buys some perceived value in maintaining the lead or catching up.

But imo: 1. progress has slowed a bit, maybe there's time to explore alternatives, 2. nvidia GPUs are pretty hard to come by, switching vendors may actually be a competitive advantage (if performance/price pans out and you can actually buy the hardware now as opposed to later).

In terms of ML "compilers"/frameworks, afaik there's:

* Google JAX/Tensorflow XLA/MLIR, * OpenAI Triton, * Meta Glow, * Apple PyTorch+Metal fork.

> CUDA is a big reason for their moat.

Zen 1 showed that absolute performance is not the end-all metric ( Zen lost on single-core performance vs Intel). A lot of people care for bang-for-buck metric. If AMD can squeak out good-enough drivers for cards with good-enough performance for a TCO[1] significantly lower than NVidia, they break Nvidia's current positive feedback cycle.

1. Initial cost and cooling - I imagine for AI data center usage, opex exceeds capex.