| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ryao 539 days ago
	> GPU support lagged behind for years, no support for APUs and no guaranteed forward compatibility were clear signs that as a whole they have no idea what they are doing when it comes to building and shipping a software ecosystem. This is likely self inflicted. They decided to make two different architectures. One is CDNA for HPC and the other is RDNA for graphics. They are reportedly going to rectify this with UDNA in the future. However, that is what they really should have done from the start. Nvidia builds 1 architecture with different chips based on it to accommodate everything and code written for one easily works on another as it is the same architecture. This is before even considering that they have PTX to be an intermediate language that serves a similar purpose to Java byte code in allowing write once, run anywhere.

1 comments

dogma1138 539 days ago

This was happening before CDNA was even a thing.

They didn’t release support even for all GPUs from the same generation and dropped support for GPUs sometime within 6 months of releasing a version that actually “worked”.

The entire core architecture behind ROCM is rotten.

P.S. NVIDIA usually has multiple CUDA feature levels even within a generation. The difference is that a) they always provide a fallback option, and usually this doesn’t require any manual intervention and b) is that as long as you define the minimum target framework when you build the binary you are guaranteed to run on all past hardware that is supported by the feature level you targeted and on all future hardware.

link

ryao 539 days ago

The differences between CUDA feature levels appear minor according to the PTX documentation:

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

They also appear to be cululative.

link

dogma1138 538 days ago

It doesn’t matter the point is that they don’t break stuff. You can still compile CUDA today to work on old hardware and your binaries are guaranteed to have forward compatibility.

You don’t get that with ROCm, and this is why it’s garbage unless someone else abstracts all of that from you.

So if Microsoft is happy to maintain an ML as a service solution that just takes prompts and maybe data it’s not your problem.

But if you need to run your own workloads and these can include workloads that are well outside of “AI” and might not be even possible or remotely profitable to have a SAAS wrapper around them it’s all on you.

link