| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sigmoid10 813 days ago
	>On GPUs, ML "just works" If you had worked with ML, you'd know that this is not true. It's actually more like the opposite. It also has nothing to do with the chips themselves. Things don't magically work "because GPU", they work because manufacturers spend the time getting their drivers and ecosystems right. That's why for example noone is using AMD GPUs for ML, despite them offering more compute per dollar on paper. Getting the software stack to the point of Nvidia/CUDA, where things really do "just work", is an enormous undertaking. And as someone who has been researching ML for more than a decade now, I can tell you Nvidia also didn't get these things right in the beginning. That's the reason why they have no real competition today (and still won't for quite some time).

2 comments

latchkey 812 days ago

> That's why for example noone is using AMD GPUs for ML

You're right, they are behind, but to say that nobody is using it, is not truthful. AMD HPC clusters are being used [0] and [1] for AI/ML.

The larger issue is that AMD has only been building HPC clusters for the last period of time. Now, with the release of MI300x, we have Azure and Oracle coming online with them now. Disclosure, my business is also building a MI300x super computer as well, with the express goal of enabling more access to developers.

[0] https://defensescoop.com/2023/08/23/navys-new-25m-supercompu...

[1] https://arxiv.org/abs/2312.12705

link

sigmoid10 811 days ago

>AMD HPC clusters are being used [0] and [1] for AI/ML.

Funny how you can immediately tell when the business people made these decisions and not the tech people. This is exactly what I would have expected from an organization like the Navy. On paper it does sound great and the Navy bean counters probably loved this. But they are in for a rough awakening.

link

latchkey 810 days ago

As far as I can tell, the only rough awakening is that they paid $25m in 2023, that costs a fraction of that today, for even better performance.

In a few months, my own cluster will be nearly 2x that size, with better networking, and we aren't spending anywhere near $25m.

Disclosure: building my own supercomputer business around AMD hardware

link

sigmoid10 800 days ago

The best I can say is that my thoughts and prayers go to the ML engineers who will actually have to deal with this. Those companies literally couldn't pay me enough to put up with it. They will likely only attract people who care about the salary and the position instead of getting things done. I've seen it with other colleagues before. These numbers of yours are completely worthless without someone who is willing to put in 5 times the work for the same or worse results.

link

latchkey 800 days ago

People choose jobs and tools for a variety of reasons. I don't feel the need to cast judgement on them over it.

The numbers I gave aren't worthless, nor does it take 5x the amount of work. I also don't think that going with a single source for hardware for all of AI is very smart either, especially given the fact that there are serious supply shortages from that single vendor. No fortune 100 would put all their eggs in one basket and even if it was 5x the work, it is worth it.

link

mike_hearn 813 days ago

Probably bartwr is using "GPUs" to mean NVIDIA GPUs. Seeing as nobody uses AMD GPUs for it, that simplification seems OK.

link