| A couple of thoughts here. * AMD's traditional target market for its GPUs has been HPC as opposed to deep learning/"AI" customers. For example, look at the supercomputers at the national labs. AMD has won quite a few high profile bids with the national labs in recent years: - Frontier (deployment begun in 2021) (https://en.wikipedia.org/wiki/Frontier_(supercomputer)) - used at Oak Ridge for modeling nuclear reactors, materials science, biology, etc. - El Capitan (2023) (https://en.wikipedia.org/wiki/El_Capitan_(supercomputer)) - Livermore national lab AMD GPUs are pretty well represented on the TOP500 list (https://top500.org/lists/top500/list/2024/06/), which tends to feature computers used by major national-level labs for scientific research. AMD CPUs are even moreso represented. * HPC tends to focus exclusively on FP64 computation, since rounding errors in that kind of use-case are a much bigger deal than in DL (see for example https://hal.science/hal-02486753/document). NVIDIA innovations like TensorFloat, mixed precision, custom silicon (e.g., the "transformer engine") are of limited interest to HPC customers. It's no surprise that AMD didn't pursue similar R&D, given who they were selling GPUs to. * People tend to forget that less than a decade ago, AMD as a company had a few quarters of cash left before the company would've been bankrupt. When Lisa Su took over as CEO in 2014, AMD market share for all CPUs was 23.4% (even lower in the more lucrative datacenter market). This would bottom out at 17.8% in 2016 (https://www.trefis.com/data/companies/AMD,.INTC/no-login-req...). AMD's "Zen moment" didn't arrive until March 2017. And it wasn't until Zen 2 (July 2019), that major datacenter customers began to adopt AMD CPUs again. * In interviews with key AMD figures like Mark Papermaster and Forrest Norrod, they've mentioned how in the years leading up to the Zen release, all other R&D was slashed to the bone. You can see (https://www.statista.com/statistics/267873/amds-expenditure-...) that AMD R&D spending didn't surpass its previous peak (on a nominal dollar, not even inflation-adjusted, basis) until 2020. There was barely enough money to fund the CPUs that would stop the company from going bankrupt, much less fund GPU hardware and software development. * By the time AMD could afford to spend on GPU development, CUDA was the entrenched leader. CUDA was first released in 2003(!), ROCm not until 2016. AMD is playing from behind, and had to make various concessions. The ROCm API is designed around CUDA API verbs/nouns. AMD funded ZLUDA, intended to be a "translation layer" so that CUDA programs can run as a drop-in on ROCm. * There's a chicken-and-egg problem here. 1) There's only one major cloud (Azure) that has ready access to AMD's datacenter-grade GPUs (the Instinct series). 2) I suspect a substantial portion of their datacenter revenue still comes from traditional HPC customers, who have no need for the ROCm stack. 3) The lack of a ROCm developer ecosystem means that development and bug fixes come much slower than they would for CUDA. For example, the mainline TensorFlow release was broken on ROCm for a while (you had to install the nightly release). 4) But, things are improving (slowly). ROCm 6 works substantially better than ROCm 5 did for me. PyTorch and TensorFlow benchmark suites will run. Trust me, I share the frustration around the semi-broken state that ROCm is in for deep learning applications. As an owner of various NVIDIA GPUs (from consumer laptop/desktop cards to datacenter accelerators), in 90% of cases things just work on CUDA. On ROCm, as of today it definitely doesn't "just work". I put together a guide for Framework laptop owners to get ROCm working on the AMD GPU that ships as an optional add-in (https://community.frame.work/t/installing-rocm-hiplib-on-ubu...). This took a lot of head banging, and the parsing of obscure blogs and Github issues. TL;DR, if you consider where AMD GPUs were just a few years ago, things are much better now. But, it still takes too much effort for the average developer to get started on ROCm today. |
AMD's strategy looks a lot like IBM's mainframe strategy of the 80s. And that didn't go well.