Hacker News new | ask | show | jobs
by bartwr 809 days ago
Seems you have not worked with ML workloads, but base your comment on "internet wisdom", or worse, business analysts (I am sorry if that's inaccurate).

On GPUs, ML "just works" (inference and training) and are always order of magnitude faster than whatever CPU you have. TPUs work very well for some model architectures (old ones that they were optimized and designed for) and on some novel others can be actually slower than a CPU (because of gathers and similar) - this was my experience working on ML stuff as an ML Researcher at Google till 2022, maybe it got better but I doubt. Older TPUs were ok only for inference of those specific models and useless for training. And anything new I tried (fundamental part of research...) - the compiler would sonetimes just break with an internal error, most of the time just produce terrible and slow code, and bugs filed against it would stay open for years.

GPU is so much more than a matrix multiplier - it's a fully general, programmable processor. With excellent compilers, but most importantly - low level access that you don't need to rely on proprietary compiler engineers (like TPU ones) and anyone can develop something like Flash Attention. And as a side note: while a Transformer might be mostly matrix multiplication, many other models are not.

4 comments

>On GPUs, ML "just works"

If you had worked with ML, you'd know that this is not true. It's actually more like the opposite. It also has nothing to do with the chips themselves. Things don't magically work "because GPU", they work because manufacturers spend the time getting their drivers and ecosystems right. That's why for example noone is using AMD GPUs for ML, despite them offering more compute per dollar on paper. Getting the software stack to the point of Nvidia/CUDA, where things really do "just work", is an enormous undertaking. And as someone who has been researching ML for more than a decade now, I can tell you Nvidia also didn't get these things right in the beginning. That's the reason why they have no real competition today (and still won't for quite some time).

> That's why for example noone is using AMD GPUs for ML

You're right, they are behind, but to say that nobody is using it, is not truthful. AMD HPC clusters are being used [0] and [1] for AI/ML.

The larger issue is that AMD has only been building HPC clusters for the last period of time. Now, with the release of MI300x, we have Azure and Oracle coming online with them now. Disclosure, my business is also building a MI300x super computer as well, with the express goal of enabling more access to developers.

[0] https://defensescoop.com/2023/08/23/navys-new-25m-supercompu...

[1] https://arxiv.org/abs/2312.12705

>AMD HPC clusters are being used [0] and [1] for AI/ML.

Funny how you can immediately tell when the business people made these decisions and not the tech people. This is exactly what I would have expected from an organization like the Navy. On paper it does sound great and the Navy bean counters probably loved this. But they are in for a rough awakening.

As far as I can tell, the only rough awakening is that they paid $25m in 2023, that costs a fraction of that today, for even better performance.

In a few months, my own cluster will be nearly 2x that size, with better networking, and we aren't spending anywhere near $25m.

Disclosure: building my own supercomputer business around AMD hardware

The best I can say is that my thoughts and prayers go to the ML engineers who will actually have to deal with this. Those companies literally couldn't pay me enough to put up with it. They will likely only attract people who care about the salary and the position instead of getting things done. I've seen it with other colleagues before. These numbers of yours are completely worthless without someone who is willing to put in 5 times the work for the same or worse results.
People choose jobs and tools for a variety of reasons. I don't feel the need to cast judgement on them over it.

The numbers I gave aren't worthless, nor does it take 5x the amount of work. I also don't think that going with a single source for hardware for all of AI is very smart either, especially given the fact that there are serious supply shortages from that single vendor. No fortune 100 would put all their eggs in one basket and even if it was 5x the work, it is worth it.

Probably bartwr is using "GPUs" to mean NVIDIA GPUs. Seeing as nobody uses AMD GPUs for it, that simplification seems OK.
Hey, this is a good comment. I've only toyed with ML stuff, but I've done a lot with GPUs. I hope you can find my "step back" perspective as valuable I find your up close one.

My chief mistake in the above comment was using "TPU", as that's Google's branding. I probably should've used "AI focused co-processor". I'm not talking exclusively about Google's foray into the space, especially as I haven't used TPUs.

My list of things to ditch on GPUs doesn't include cores. My point there is that there's a bunch of components that are needed for graphics programming that are entirely pointless for AI workloads, both inside the core's ALU and as larger board components. The hardware components needed for AI seem relatively well understood at this point (though that's possible to change with some other innovation).

Put another way, my point is this: Historically, the high end GPU market was mostly limited to scientific computing, enthusiast gaming, and some varied professional workloads. Nvidia has long been king here, but with relatively little attempt by others at competition. ML was added to that list in the last decade, but with some few exceptions (Google's TPU), the people who could move into the space haven't. Then chatGPT happened, investment in AI has gone crazy, and suddenly Nvidia is one of the most valuable companies in the world.

However, The list of companies who have proven they can make all the essential components (in my list in the grandparent) isn't large, but it's also not just Nvidia. Basically every computing device with a screen has some measure of GPU components, and now everyone is paying attention to AI. So I think within a few years Nvidia's market leadership will be challenged, and they certainly won't be the only supplier of top of the line AI co-processors by the end of the decade. Whether first mover advantage will keep them in first place, time will tell.

ML doesn't just work on GPUs. It's not uncommon to have architectures where GPUs don't really work, we just tend not to use those :)
Also, it's disingenuous to say "there's only 4 things you need to beat NVIDIA" when each of the 4 is an enormous undertaking.
not to mention every not-so-serious, inference heavy ML developers just want something to work to deliver to client. That itself is a semi-moat.
It's been talked to death but non-CUDA implementations have their challenges regardless of use case. That's what first-mover advantage and > 15 years of investment by Nvidia in their overall ecosystem will do for you.

But support for production serving of inference workloads outside of CUDA is universally dismal. This is where I spend most of my time and compared to CUDA anything else is non-existent or a non-starter unless you're all-in on packaged API driven Google/Amazon/etc tooling utilizing their TPUs (or whatever). The most significant vendor/cloud lock-in I think I've ever seen.

Efficient and high-scale serving of inference workloads is THE thing you need to do to serve customers and actually have a chance at ever making any money. It's shocking to me that Nvidia/CUDA has a complete stranglehold on this obvious use case.

A great summary of how unserious NVIDIA's competitors are is how long it took AMD's flagship consumer/retail GPU, the 7900 XT[X], to gain ROCm support.

That's quite literally unacceptable.

For those who don't know - one year after launch.

Meanwhile Nvidia will go as far as to back port Hopper support to CUDA 11.8 so it "just runs" the day of launch with everything you already have.