Hacker News new | ask | show | jobs
by jjoonathan 716 days ago
So I knew that AMD's compute stack was a buggy mess -- nobody starts out wanting to pay more for less and I had to learn the hard way how big of a gap there was between AMD's paper specs and their actual offerings -- and I also knew that Nvidia had a huge edge at the cutting edge of things, if you need gigashaders or execution reordering or whatever, but ML isn't any of that. The calculations are "just" matrix multiplication, or not far off.

I would have thought AMD could have scrambled to fix their bugs, at least the matmul related ones, scrambled to shore up torch compatibility or whatever was needed for LLM training, and pushed something out the door that might not have been top-of-market but could at least have taken advantage of the opportunity provided by 80% margins from team green. I thought the green moat was maybe a year wide and tens of millions deep (enough for a team to test the bugs, a team to fix the bugs, time to ramp, and time to make it happen). But here we are, multiple years and trillions in market cap delta later, and AMD still seems to be completely non-viable. What happened? Did they go into denial about the bugs? Did they fix the bugs but the industry still doesn't trust them?

1 comments

It's roughly that the AMD tech works reasonably well on HPC and less convincingly on "normal" hardware/systems. So a lot of AMD internal people think the stack is solid because it works well on their precisely configured dev machines and on the commercially supported clusters.

Other people think it's buggy and useless because that's the experience on some other platforms.

This state of affairs isn't great. It could be worse but it could certainly be much better.