Hacker News new | ask | show | jobs
by latchkey 381 days ago
AMD is consistently stacking more HBM.

  H100 80GB HBM3
  H200 141GB HBM3e
  B200 192GB HBM3e

  MI300x 192GB HBM3
  MI325x 256GB HBM3e
  MI355x 288GB HBM3e
This means that you can fit larger and larger models into a single node, without having to go out over the network. The memory bandwidth on AMD is also quite good.
3 comments

It really does not matter how much memory AMD has if the drivers and firmware are unstable. To give one example from last year:

https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...

They are currently developing their own drivers for AMD hardware because of the headaches that they had with ROCm.

"driver" is such a generic word. tinygrad works on mi300x. If you want to use it, you can. Negates your point.

Additionally, ROCm is a giant collection of a whole bunch of libraries. Certainly there are issues, as with any large collection of software, but the critical thing is whether or not AMD is responsive towards getting things fixed.

In the past, it was a huge issue, AMD would routinely ignore developers and bugs would never get fixed. But, after that SA article, Lisa lit a fire under Anush's butt and he's taking ownership. It is a major shift in the entire culture at the company. They are extremely responsive and getting things fixed. You can literally tweet your GH issue to him and someone will respond.

What is true a year ago isn't today. If you're paying attention like I am, and experiencing it first hand, things are changing, fast.

I have been hearing this about AMD/ATI drivers for decades. Every year, someone says that it is fixed, only for new evidence to come out that they are not. I have no reason to believe it is fixed given the history.

Here is evidence to the contrary: If ROCm actually was in good shape, tinygrad would use it instead of developing their own driver.

You're conflating two different things.

ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.

The part of ROCm you're interested in is HIP; HIP is the part that does legacy CUDA emulation. HIP will never be complete because Nvidia keeps adding new things, documents things wrong, and also the "cool" stuff people do on Nvidia cards aren't CUDA and it is out of scope for HIP to emulate PTX (since that is strongly tied to how historical Nvidia architectures worked, and would be entirely inappropriate for AMD architectures).

The whole thing with Tinygrad's "driver" isn't a driver at all, its the infrastructure to handle card to card ccNUMA on PCI-E-based systems, which AMD does not support: if you want that, you buy into the big boy systems that have GPUs that communicate using Infinity Fabric (which it, itself, is the HyperTransport protocol over PCI-E PHY instead of over HyperTransport PHY; PCI over PCI-E has no ability to handle ccNUMA meaningfully).

Extremely few customers, AMD's or not, want to share VRAM directly over PCI-E across GPUs since most PCI-E GPU customers are single GPU. Customers that have massive multi-GPU deployments have bought into the ecosystem of their preferred vendor (ie, Nvidia's Mellanox-powered fabrics, or AMD's wall-to-wall Infinity Fabric).

That said, AMD does want to support it if they can, and Tinygrad isn't interested in waiting for an engineer at AMD to add it, so they're pushing ahead and adding it themselves.

Also, part of Tinygrad's problem is they want it available from ROCm/HIP instead of a standards compliant modern API. ROCm/HIP still has not been ported to the modern shader compiler that the AMD driver uses (ie, the one you use for OpenGL, Vulkan, and Direct family APIs), since it originally came from an unrelated engineering team that isn't part of the driver team.

The big push in AMD currently is to unify efforts so that ROCm/HIP is massively simplified and all the redundant parts are axed, so it is purely a SPIR-V code generator or similar. This would probably help projects like Tinygrad someday, but not today.

> ROCm isn't part of AMD drivers, its a software library that helps you support legacy compute APIs and stuff in the BLAS/GEMM/LAPACK end of things.

AMD says otherwise:

> AMD ROCm™ is an open software stack including drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.

https://www.amd.com/en/products/software/rocm.html

The issues involving AMD hardware not only applied to the drivers, but to the firmware below the drivers:

https://www.tomshardware.com/pc-components/gpus/amds-lisa-su...

Tinygrad’s software looks like a userland driver:

https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...

It loads various firmware blobs, manages part of the initialization process, manages memory, writes to registers, etcetera. These are all things a driver does.

AMD is extremely bad at communications. The driver already contains everything ROCm requires to talk to the GPU, and ROCm itself is only a SDK that contains runtimes, libraries, and compilers.

This part of TinyGrad is not a driver, however it tries to hijack the process to do part of that task. You cannot boot the system with this, and it does not replace any part of the Mesa/DRI/DRM/KMS/etc stack. It does reinitialize the hardware with a different firmware, which might be why you think this is a driver.

https://community.amd.com/t5/ai/what-s-new-in-amd-rocm-6-4-b...

ROCm 6.4 software introduces the Instinct GPU Driver, a modular driver architecture that separates the kernel driver from ROCm user space.

We have all been hearing things for decades. Things are noticeably different now. Live in the present, not in the past.

Tinygrad isn’t a driver. It is a framework. It is being developed by George however he wants. If he wants to build something that gives him more direct control over things. Fine. Others might write PTX instead if using higher level abstractions.

Fact is that tinygrad runs not only on AMD, but also Nvidia and others. You might want to reassess your beliefs because you’re reading into things and coming up with the wrong conclusions.

I read tinygrad’s website:

https://tinygrad.org/#tinygrad

Under driver quality for AMD, they say “developing” and point to their git repository. If AMD had fixed the issues, they would instead say the driver quality is great and get more sales.

They can still get sales even if they are honest about the state of AMD hardware, since they sell Nvidia hardware too, while your company would risk 0 sales if you say anything other than “everything is fine”, since your business is based on leasing AMD GPUs:

https://hotaisle.xyz/pricing/

Given your enormous conflict of interest, I will listen to what George Hotz and others are saying over what you say on this matter.

Exactly, it is not a driver.

Appreciate you diving more into my business. Yes, we are one of the few that publishes transparent pricing.

When we started, we got zero sales, for a long time. Nobody knew if these things performed or not. So we donated hardware and people like ChipsAndCheese started to benchmark and write blog posts.

We knew the hardware was good, but the software sucked. 16 or so months later, things have changed and sufficiently improved that now we are at capacity. My deep involvement in this business is exactly how I know what’s going on.

Yes, I have a business to run, but at the same time, I was willing to take the risk, when no-one else would, and deploy this compute. To insinuate that I have some sort of conflict of interest is unfair, especially without knowing the full story.

At this juncture, I don’t know what point you’re trying to make. We agree the software sucked. Tinygrad now runs on mi300x. Whatever George’s motivations were a year ago are no longer true today.

If you feel rocm sucks so badly, go the tinygrad route. Same if you don’t want to be tied to cuda. Choice is a good thing. At the end of the day though, this isn’t a reflection on the hardware at all.

That was last year Mi300x firmware and software have gotten much better since then
Unfortunately, AMD and ATI before it have had driver quality issues for decades; and both they and their fans have claimed that they have solved the problems every year since.

Even if they have made progress, I doubt that they have reached parity with Nvidia. I have had enough false hope from them that I am convinced that the only way that they will ever improve their drivers if they let another group write the drivers for them.

Coincidentally, Valve has been developing the Vulkan driver used by SteamOS and other Linux distributions, which is how SteamOS is so much better than Windows. If AMD could get someone else to work on improving their GPGPU support, we would likely see it become quite good too. Until then, I have very low expectations.

Gaming != HPC
GPUs were originally designed for gaming. Their ability to be used in HPC grew out of that. The history of issues goes back rather far.
Thanks, I totally had no idea what the G stood for. /s

Seriously though, you’re clearly stuck in the past. This is tech. It evolves fast.

Holding onto grudges just slows you down.

So the MI300x has 8 different memory domains, and although you can treat it as one flat memory space, if you want to reach their advertised peak memory bandwidth you have to work with it like an 8-socket board.
MI355X isn't out yet, and the upcoming B300 also has 288GB HBM3e
June 12th.

B300 is Q4 2025.

Yes, they keep leapfrogging each other. AMD is still ahead in vram.