| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Tpt 457 days ago
	If I understand correctly, this library provides some Torch kernels customized for AMD hardware. Why haven't they just upstreamed them to PyTorch for better adoption? Also, they seem to demo usage with Torch default eager execution mode and not Torch JIT/TorchScript. Is this library compatible with TorchScript?

4 comments

microtonal 457 days ago

I think a lot of stuff will get upstreamed eventually. PyTorch just moves slower and since it’s a stable library, I think it cannot rapidly adopt something like fused MoE until the dust has settled a little and it’s clear what the API would look like long-term.

I think it’s ok that stuff is tried first in Torch extensions. That’s how Flash Attention started after all and the same is true for newer kernels in CUDA-land (fused MoE, MLA, Marlin, etc.).

With regards to TorchScript, that’s really legacy - torch.compile is where it’s at. This post seems to suggest that the kernels work with torch.compile: https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR...

barrenko 457 days ago

I really do not understand why can't they just work with existing OSS developers pulling their hair out trying to make AMD devices work and instead do it this way. It's like Mozilla with the questionable decisions.

roenxi 457 days ago

There are a lot of OSS developers, I doubt AMD has the resources to do that. And realistically they don't need to, I wandered over to watch some George Hotz videos the other day and it looked like the AMD driver situation has improved to the point where specialist AMD access isn't needed to debug any more. Which is a huge change and very exciting for me personally because it means I might be able to jump back to an AMD card and ditch the mess that is Nvidia on Linux.

In theory they might not even need to be involved in optimising compute kernels, there is probably some PhD student who'll do the work because they want to be a kernel-optimising specialist. In practice a few strategic applications of paid talent is all they really need to do. Everyone wants to diversify off Nvidia so there is a lot of interest in supporting AMD if they are willing to push out firmware that multiplies matrices without crashing. Which has been a weird sticking point for AMD for a surprising amount of time.

impossiblefork 457 days ago

There's only one Pytorch though, and it's what people are using for ML nowadays.

Back in the day you had to optimize your card for Quake, do everything to make it run well. Now you have to do that for Pytorch.

roenxi 456 days ago

> Back in the day you had to optimize your card for Quake...

That is exactly the attitude that got AMD out in the cold away from the AI revolution; they learned a lot of stupid lessons about optimising to specific games and present-day use cases instead of trying to implement general capabilities to a higher standard like Nvidia did in CUDA. They ended up a decade away from a multi-trillion dollar market

PyTorch might be special. I wouldn't be at all surprised if AMD does have a dedicated engineer working on PyTorch. But their problem to date hasn't been that their engagement with PyTorch, but rather that literally nobody could make PyTorch work on AMD cards which had buggy and terrible support for GPGPU work. If they fixed that some random might do the work without their involvement because a lot of people want to see that happen.

impossiblefork 456 days ago

Now that the required task is known though, it doesn't really matter. If AMD understand that, they should have no problem putting engineers on making Pytorch work well.

Considering its importance, it shouldn't be one engineer. It should be 50+.

fock 457 days ago

I think they are taken over by exactly the same people leading the AI-hype. Funny how in this article they are a) not advertising clearly what they are doing, b) solving a small subset of problems in a way noone asked for (I think most people just want ROCm to work at all...) and c) just adding to a complex product without any consideration of actually integrating with its environment.

I guess it's vibecoding "AI"...

microtonal 457 days ago

solving a small subset of problems in a way noone asked for

What do you mean? Having ROCm fused MoE and MLA kernels as a counterpart to kernels for CUDA is very useful. AMD needs to provide this if they want to keep AMD accelerators competitive with new models.

fock 457 days ago

should the matrix-multiplication at the core of this not be in a core library? Why are generic layers intermixed with LLM-specific kernels when the generic layers are duplicating functionality in torch?

Upstreaming that might actually help researchers doing new stuff vs. the narrow demographic of people speeding LLMs on MI300X's.

imtringued 457 days ago

They are imitating Nvidia's TensorRT with AITER. Basically AMD wants to have "CUDA, but not CUDA".

tdullien 457 days ago

They'd like to have CUDA, period, but are legally barred from it.

almostgotcaught 457 days ago

> They are imitating Nvidia's TensorRT

Do you know what the RT in TensorRT stands for? hint: AITER has nothing to do with TensorRT.

fc417fc802 457 days ago

> I think most people just want ROCm to work at all

I think most people don't want to have to think about vendor lock-in related bullshit. Most people just want their model to run on whatever hardware they happen to have available, don't want to have to worry about whether or not future hardware purchases will be compatible, and don't want to have to rewrite everything in a different framework.

Most people fundamentally don't care about ROCm or CUDA or OneAPI or whatever else beyond a means to an end.

hoomanmo 457 days ago

which Mozilla's questionable decisions are you referring to?

kouteiheika 457 days ago

> Why haven't they just upstreamed them to PyTorch for better adoption?

They don't seem to care, or don't understand how to get broader adoption.

For some reason AMD's management is dead set on targeting only the high end part of the market. Like, for example, look at this blog post. Which model they're testing? DeepSeek R1, the 671B behemoth that no normal person can run. Or look at any of their tutorials/docs and see which GPUs they support - it's always only either the unobtanium-grade enterprise GPUs, or high end workstation cards that no one buys. And if your strategy is to target only the super rich entities then a little jank in the software isn't really all that punishing - if you can afford to drop a few million on GPUs then you can also afford to hire someone to spend a few weeks getting AMD's software to work/get it tuned by tweaking two dozen environment variables they do seem to like so much/etc.

saagarjha 457 days ago

> For some reason AMD's management is dead set on targeting only the high end part of the market.

Because those people are dropping $100 billion on GPU clusters and individuals are not

impossiblefork 457 days ago

Yes, but researchers use Pytorch and those researchers end up being the end users of the GPU clusters.

NVIDIA GPUs sell so well because they work with what researchers actually use.

saagarjha 456 days ago

Oh I definitely think they should upstream to PyTorch, I'm just saying doing the usual "why doesn't AMD think of the gamers^W^W^W^W^W local model users" is not going to sway their policies.

imtringued 457 days ago

That would make the kernels the PyTorch Foundations's problem and they would have to set up CI infrastructure around AMD GPUs to maintain these kernels. For whatever reason, AMD really wants to keep everything in-house even though that has been a losing strategy so far.