Hacker News new | ask | show | jobs
by the_king 539 days ago
> It’s not just that it’s immature software, they need to change how they do development.

I remember geohot saying something similar about a year ago

4 comments

I expect everyone has been saying it for a while, the calls are just getting more strident and public as it becomes clear that AMD's failures are strategic rather than tactical. And as people try to build business on their half-hearted attempts.

I still think it is a mistake to say that CUDA is a moat. IMO the problem here is that AMD still doesn't seem to think that GPGPU compute is a thing. They don't seem to understand the idea that someone might want to use their graphics cards to multiply matricies independently of a graphics pipeline. All the features CUDA supports are irrelevant compared to the fact that AMD can't handle GEMM performantly out of the box. In my experience it just can't do it, back in the day my attempts to multiply matrices would crash drivers. That isn't a moat, but it certainly is something spectacular.

If they could manage an engineering process that delivered good GEMM performance then the other stuff can probably get handled. But without it there really is a question of what these cards are for.

I wonder to what extent vulkan compute could be used for this. Of course, it is only an option on their RDNA GPUs since CDNA is not for graphics, even though that is the G in GPU.
There has been some testing within llama.cpp, which supports both Vulkan and ROCM-Blas. When it works, the latter is about 2x faster than the Vulkan version.
Unless it provides the polyglot capabilities of CUDA, and related IDE and graphical debugging capabilities, not really.
Yeah, 80% margins on matrix multiplication should be a puddle not a moat but AMD is more scared of water than the witch that melts in Wizard of Oz so I guess the puddle is a moat after all.
Anyone who looks at the mess that is ROCm and the design choices they made could easily see that.

GPU support lagged behind for years, no support for APUs and no guaranteed forward compatibility were clear signs that as a whole they have no idea what they are doing when it comes to building and shipping a software ecosystem.

To that you can add the long history of both AMD and ATI before they merged releasing dog shit software and then dropping support for it.

On the other hand you can take any CUDA binary even one that dates back to the original Tesla and run it on any modern NVIDIA GPU.

> GPU support lagged behind for years, no support for APUs and no guaranteed forward compatibility were clear signs that as a whole they have no idea what they are doing when it comes to building and shipping a software ecosystem.

This is likely self inflicted. They decided to make two different architectures. One is CDNA for HPC and the other is RDNA for graphics. They are reportedly going to rectify this with UDNA in the future. However, that is what they really should have done from the start. Nvidia builds 1 architecture with different chips based on it to accommodate everything and code written for one easily works on another as it is the same architecture. This is before even considering that they have PTX to be an intermediate language that serves a similar purpose to Java byte code in allowing write once, run anywhere.

This was happening before CDNA was even a thing.

They didn’t release support even for all GPUs from the same generation and dropped support for GPUs sometime within 6 months of releasing a version that actually “worked”.

The entire core architecture behind ROCM is rotten.

P.S. NVIDIA usually has multiple CUDA feature levels even within a generation. The difference is that a) they always provide a fallback option, and usually this doesn’t require any manual intervention and b) is that as long as you define the minimum target framework when you build the binary you are guaranteed to run on all past hardware that is supported by the feature level you targeted and on all future hardware.

The differences between CUDA feature levels appear minor according to the PTX documentation:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

They also appear to be cululative.

It doesn’t matter the point is that they don’t break stuff. You can still compile CUDA today to work on old hardware and your binaries are guaranteed to have forward compatibility.

You don’t get that with ROCm, and this is why it’s garbage unless someone else abstracts all of that from you.

So if Microsoft is happy to maintain an ML as a service solution that just takes prompts and maybe data it’s not your problem.

But if you need to run your own workloads and these can include workloads that are well outside of “AI” and might not be even possible or remotely profitable to have a SAAS wrapper around them it’s all on you.

> On the other hand you can take any CUDA binary even one that dates back to the original Tesla and run it on any modern NVIDIA GPU

This particular difference stems the fact that NVIDIA has PTX and AMD does not have any such thing. Ie this kind of backwards compatibility will never be possible on AMD.

Backward compatibility is one thing but not having a forward compatibility is a killer.

Having to create a binary that targets a very specific set of hardware and having no guarantees and in fact having a guarantee that it won’t on future hardware is what make ROCM unusable for anything you intend to ship.

What’s worse is that they also drop support for their GPUs faster than Leo drops support for his girlfriends once they reach 25…

So not only that you have to recompile there is no guarantee that your code would work with future versions of ROCM or that future versions of ROCM could still produce binaries which are compatible with your older hardware.

Like how is this not the first design goal to address when you are building a CUDA competitor I don’t fucking know.

> Like how is this not the first design goal to address when you are building a CUDA competitor I don’t fucking know.

The words "tech debt" do not have any meaning at AMD. No one understands why this is a problem.

Backwards compatibility, and polyglot ecosystem, thanks to the amount of compiler toolchains that support PTX.
"The software needs to be better" is (and was) an easy call to make for anyone paying attention. The problem is that "AMD just needs to do better" is not and will never be an implementable strategy. Engineering isn't just about money. It's also about the process of exploring all the edge cases.

"We recommend that AMD to fix their GEMM libraries’ heuristic model such that it picks the correct algorithm out of the box instead of wasting the end user’s time doing tuning on their end." Is such a profoundly unhelpful thing to say unless you imagine AMDs engineers just sitting around wondering what to do all day.

AMD needs to make their drivers better, and they have. Shit just takes time.

Sounds more like they were (and still are) being sloppy. “be better” is one thing. “runs without fatal crash” is what semi is talking about.
In buggy numerical code many bugs go trough the software stack without any problems. No crash, no errors. For example,you might switch two double parameters to a function and if their value range is similar, everything works fine except it's all bullshit.

If there are bugs in AMD code that prevent running tests, I bet there are even more bugs that don't manifest until you look at results.