Hacker News new | ask | show | jobs
by dogma1138 539 days ago
Anyone who looks at the mess that is ROCm and the design choices they made could easily see that.

GPU support lagged behind for years, no support for APUs and no guaranteed forward compatibility were clear signs that as a whole they have no idea what they are doing when it comes to building and shipping a software ecosystem.

To that you can add the long history of both AMD and ATI before they merged releasing dog shit software and then dropping support for it.

On the other hand you can take any CUDA binary even one that dates back to the original Tesla and run it on any modern NVIDIA GPU.

2 comments

> GPU support lagged behind for years, no support for APUs and no guaranteed forward compatibility were clear signs that as a whole they have no idea what they are doing when it comes to building and shipping a software ecosystem.

This is likely self inflicted. They decided to make two different architectures. One is CDNA for HPC and the other is RDNA for graphics. They are reportedly going to rectify this with UDNA in the future. However, that is what they really should have done from the start. Nvidia builds 1 architecture with different chips based on it to accommodate everything and code written for one easily works on another as it is the same architecture. This is before even considering that they have PTX to be an intermediate language that serves a similar purpose to Java byte code in allowing write once, run anywhere.

This was happening before CDNA was even a thing.

They didn’t release support even for all GPUs from the same generation and dropped support for GPUs sometime within 6 months of releasing a version that actually “worked”.

The entire core architecture behind ROCM is rotten.

P.S. NVIDIA usually has multiple CUDA feature levels even within a generation. The difference is that a) they always provide a fallback option, and usually this doesn’t require any manual intervention and b) is that as long as you define the minimum target framework when you build the binary you are guaranteed to run on all past hardware that is supported by the feature level you targeted and on all future hardware.

The differences between CUDA feature levels appear minor according to the PTX documentation:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

They also appear to be cululative.

It doesn’t matter the point is that they don’t break stuff. You can still compile CUDA today to work on old hardware and your binaries are guaranteed to have forward compatibility.

You don’t get that with ROCm, and this is why it’s garbage unless someone else abstracts all of that from you.

So if Microsoft is happy to maintain an ML as a service solution that just takes prompts and maybe data it’s not your problem.

But if you need to run your own workloads and these can include workloads that are well outside of “AI” and might not be even possible or remotely profitable to have a SAAS wrapper around them it’s all on you.

> On the other hand you can take any CUDA binary even one that dates back to the original Tesla and run it on any modern NVIDIA GPU

This particular difference stems the fact that NVIDIA has PTX and AMD does not have any such thing. Ie this kind of backwards compatibility will never be possible on AMD.

Backward compatibility is one thing but not having a forward compatibility is a killer.

Having to create a binary that targets a very specific set of hardware and having no guarantees and in fact having a guarantee that it won’t on future hardware is what make ROCM unusable for anything you intend to ship.

What’s worse is that they also drop support for their GPUs faster than Leo drops support for his girlfriends once they reach 25…

So not only that you have to recompile there is no guarantee that your code would work with future versions of ROCM or that future versions of ROCM could still produce binaries which are compatible with your older hardware.

Like how is this not the first design goal to address when you are building a CUDA competitor I don’t fucking know.

> Like how is this not the first design goal to address when you are building a CUDA competitor I don’t fucking know.

The words "tech debt" do not have any meaning at AMD. No one understands why this is a problem.

Backwards compatibility, and polyglot ecosystem, thanks to the amount of compiler toolchains that support PTX.