Hacker News new | ask | show | jobs
by jmward01 807 days ago
Yeah, this has stopped me from trying anything with them. They need to lead with their consumer cards so that developers can test/build/evaluate/gain trust locally and then their enterprise offerings need to 100% guarantee that the stuff developers worked on will work in the data center. I keep hoping to see this but every time I look it isn't there. There is way more support for apple silicon out there than ROCm and that has no path to enterprise. AMD is missing the boat.
2 comments

In fairness it wasn't Apple who implemented the non-mac uses of their hardware.

AMD's driver is in your kernel, all the userspace is on GitHub. The ISA is documented. It's entirely possible to treat the ASICs as mass market subsidized floating point machines and run your own code on them.

Modulo firmware. I'm vaguely on the path to working out what's going on there. Changing that without talking to the hardware guys in real time might be rather difficult even with the code available though.

You are ignoring that AMD doesn't use an intermediate representation and every ROCm driver is basically compiling to a GPU specific ISA. It wouldn't surprise me that there are bugs they have fixed for one ISA that they didn't bother porting to the others. The other problem is that most likely their firmware contains classic C bugs like buffer overflows, undefined behaviour, or stuff like deadlocks.
This is sort of true. Graphics compiles to spir-v, moves that around as deployment, then runs it through llvm to create the compiled shaders. Compute doesn't bother with spir-v (to the distress of some of our engineers) and moves llvm IR around instead. That goes through the llvm backend which does mostly the same stuff for each target machine. There probably are some bugs that were fixed on one machine and accidentally missed on another - the compiler is quite branchy - but it's nothing like as bad as a separate codebase per ISA. Nvidia has a specific ISA per card too, they just expose PTX and SASS as abstractions over it.

I haven't found the firmware source code yet - digging through confluence and perforce tries my patience and I'm supposed to be working on llvm - but I hear it's written in assembly, where one of the hurdles to open sourcing it is the assembler is proprietary. I suspect there's some common information shared with the hardware description language (tcl and verilog or whatever they're using). To the extent that turns out to be true, it'll be immune to C style undefined behaviour, but I wouldn't bet on it being free from buffer overflows.

You are right, AMD should do more with consumer cards, but I understand why they aren't today. It is a big ship, they've really only started changing course as of last Oct/Nov, before the release of MI300x in Dec. If you have limited resources and a whole culture to change, you have to give them time to fix that.

That said, if you're on the inside, like I am, and you talk to people at AMD (just got off two separate back to back calls with them), rest assured, they are dedicated to making this stuff work.

Part of that is to build a developer flywheel by making their top end hardware available to end users. That's where my company Hot Aisle comes into play. Something that wasn't available before outside of the HPC markets, is now going to be made available.

I look forward to seeing it. NVIDIA needs real competition for their own benefit if not the market as a whole. I want a richer ecosystem where Intel, AMD, NVIDIA and other players all join in with the winner being the consumer. From a selfish point of view I also want to do more home experimentation. LLMs are so new that you can make breakthroughs without a huge team but it really helps to have hardware to make it easier to play with ideas. Consumer card memory limitations are hurting that right now.
> I want a richer ecosystem where Intel, AMD, NVIDIA and other players all join in with the winner being the consumer.

This is exactly the void I'm trying to fill.

I can't find your company, what are you trying to do to solve this?
Thanks for asking, it is right there in my profile... click on my username.