Hacker News new | ask | show | jobs
Stable Diffusion on AMD RDNA3 (nod.ai)
158 points by tomtomlapomme 1277 days ago
8 comments

> SHARK is an open source cross platform (Windows, macOS and Linux) Machine Learning Distribution packaged with torch-mlir (for seamless PyTorch integration), LLVM/MLIR for re-targetable compiler technologies along with IREE (for efficient codegen, compilation and runtime) and Nod.ai’s tuning. IREE is part of the OpenXLA Project

Google has been doing a good job advancing the IREE ML compiler project, which I think is what will bring other hw platforms like AMD and Intel to the ML game. Industry only has to benefit from increased hardware portability.

Does anyone know what's the current state of AMD's tools to migrate from CUDA? There's so much untapped potential with these cards, it's crazy that basically only gamers can make use of their competitive prices
Last time I seriously checked (6 months ago or so) ROCm was still a far cry from CUDA. Set up was a mess, support was hit and miss, some operations were not particularly performante compared to the CUDA counterparts. Additionally, there are Tensorflow and probably PyTorch forks that should work with it, but they lag behind the official repositories quite a bit.

I hope that now that generative AI is becoming mainstream AMD steps up their game both on their consumer and professional lineups. If I were to buy a video card right now ( mostly for gaming+ML hobbies projects + running stable diffusion) I wouldn't pick AMD because I could do just 1/3 of my use cases properly without headaches (gaming).

OpenCL works pretty well. Can't say I notice large gaps of performance between CUDA and openCL for my hpc work.
Thankfully for a good chunk of number crunching that works fine. But the other side of the coin is notably AI workloads. There's no OpenCL or Vulkan standard for exposing matrix units, only vendor specific ones.

For OpenCL: cl_qcom_ml_ops (Qualcomm) notably, for Vulkan: VK_NV_cooperative_matrix (NVIDIA)

Have you done any benchmarks with vulkan?
No I haven't used vulkan for compute.
I don’t think there’s truly a competitor but opencl is the alternative to shoot for. Otherwise for machine learning purposes amd helps develop ROCm.
OpenCL is hardly an alternative, plain old C, using compilation from source at runtime, with very basic tooling available.

Versus a polyglot compiler infrastructure, IDE tooling that includes shader debugging, and a rich ecosytem of GPU based libraries.

Even with SYSCL and SPIR-V, that has hardly improved, and while Intel bases oneAPI on top of SYSCL, that naturally also goes beyond the standard.

Do you have an opinion on the new openCL implementation that recently got merged into mesa? It doesn't touch on tooling or the other points you mentioned, but performance seems to be pretty good!

https://www.phoronix.com/news/Rusticl-2022-XDC-State

It's still a heavy work in progress. Not usable for SYCL programs as SVM isn't implemented yet.
No, it doesn't seem to matter for what makes CUDA relevant anyway.
What do you mean by a polyglot compiler infrastructure? Are you referring to the fact that CUDA source is single-file (your host and device code are in the same compilation unit?) Or do you mean that you can ship the same binary to different GPU architectures?

SYCL solves the first issue, and SPIR-V solves the second one. (OpenCL mostly avoids the issue in general though by making you ship source which is then compiled by the driver, but SPIR-V allows you to ship a 'binary' instead).

No clue as for debugging and IDE tooling, but I did find a rocgdb binary on my Linux ROCm installation (which is for HIP, not SYCL). No clue what oneAPI offers for debugging.

Furthermore, Clang (and hence clangd) speaks HIP and I think SYCL too. So the non-runtime IDE tooling should work.

Finally, a lot of GPU libraries are I think available for ROCm/HIP too. It's unfortunate that the HIP stack sucks enormously in other ways.

Shouldn't we have an API that can speak to both CUDA and opencl? Or is opencl sufficiently capable?
I understand AMD HiP is a CUDA clone, where library functions have the same syntax but with hip replacing cuda in the function names.

Behind, it can use AMD and NVIDIA hardware alike. Thus, the idea is that through typically negligible effort porting to HiP, your code becomes vendor-independent.

In practice, I do not know how true this is.

> Thus, the idea is that through typically negligible effort porting to HiP, your code becomes vendor-independent.

Here, the big AMD mistake was to rename those function prefixes in the first place. It's a mistake that they could have avoided...

What a lot of SW codebases did to support AMD (see PyTorch code notably): codebase is still CUDA, have the conversion pass to HIP done at build time.

See https://github.com/ROCm-Developer-Tools/HIPIFY/blob/amd-stag... for the Perl script to do it.

Then comes the problem of AMD not supporting ROCm HIP on most of their hardware or user base.

On Windows, the ROCm HIP SDK is private and only available under NDA. This means that while you can use Blender w/ HIP on Windows, the Blender builds that you compile yourself will not be able to use ROCm HIP.

On Linux, the supported GPUs are few and far between, Vega20 onwards are supported today. APUs, RDNA1, and lower end RDNA2 w/o unsupported hacks (6700 XT and below) are excluded.

No it isn't because it lacks the polyglot infrastructure from CUDA, it has now SPIR-V but hardly anyone targets it as PTX gets used.
The performances on Blender3d are atrocious, the RX 7900 XTX is noticeably slower than a RTX 3060.
A big part of the reason is that Blender on Nvidia supports hardware accelerated ray tracing using OptiX. HIP-RT exists, but is not used in Blender yet. I think the Intel oneAPI backend for Arc GPUs also misses RT acceleration.

AMD claims to have HIP-RT working internally, but not yet suitable for posting publically. Intel is planning it, I think. Both should land around Blender 3.6, if I'm not mistaken.

If you take the raw FLOPS, CUDA (not OptiX) and HIP are actually nearly equivalent in performance last I remember. I think RDNA2 just does "more with less", at least in terms of gaming performance per FLOP (e.g. due to the huge cache).

Latest Blender release does not have the optimization work in yet.

AIUI, what's in current git master is very different.

I really wish more GPU libraries had focused on vulkan instead of CUDA ...
It's one of the reasons Nvidia is basically untouchable at this time. The AI field willingly enslaved itself to NVidia.
It's because NVIDIA actually cared and AMD does not where it matters (customer HW).

Openness is totally secondary to functional. You know, the same kind of reasons as of why Linux on the desktop is not a mass market thing compared to Windows for a very long time.

OpenCL is totally functional, on AMD and even iGPU Intel. The reason Nvidia won was because they made it easier. And the AI people ate it up. The tooling Nvidia offers is second to none. But you can build almost anything CUDA does with openCL. It's simply harder to do.

The AI crowd cared more about that then the impact of tieing the entire ecosystem to a single company.

Who knows what openCL might have been if it would be the premier implementation language. I'd wager it would have gotten a LOT more love.

No. It's not.

AMD only officially supports GPGPU headless. That discounts 90% of the market. Old graphics cards lose support randomly. That discounts much of the rest. The whole thing is a horrible, bug-ridden mess.

I'd pick AMD over NVidia if it was e.g. 50% slower at the same price point -- open source is worth waiting for -- but I can't take nonworking.

AMD also has no support. I'm now building tooling reliant on NVidia, so if AMD ever gets their stuff working, we're many backports away from a working ecosystem. The longer AMD takes, the deeper the hole.

> OpenCL is totally functional, on AMD and even iGPU Intel.

The fact that it’s not functional on non-NVIDIA platforms is the whole reason Blender dropped OpenCL support. If you’re going to write a bunch of implementation-specific code to handle AMD’s bugs/non-compliant runtime, why not just target CUDA directly?

“it’s open-source, if you want it fixed then pay someone to do it or just spend a month making a patch instead of doing your work???”

bugs-bunny-no.gif

like, same with ROCm, is AMD just wants to externalize all their costs onto customers yet still wants them to adopt it instead of the turnkey solution that everyone else already uses. Why would anyone do that? It’s great for NVIDIA but terrible for users.

(and as for the “AMD drivers have been good for like ten years now!!!” crowd… counterpoint: the entire 5700XT thing, drivers broken for the first 18 months, just like Vega before it. And oh look 7900XTX is turning into a trainwreck too. There’s just constant showstopping bugs with AMD drivers. Just like with ROCm too… patchwork support and endless bugs that don’t exist in the industry-standard solution. Nobody wants to spend their time doing AMD’s job for them.)

To their credit this is one thing Intel got right… they probably spent more dev time on oneAPI in the last year than AMD spent on ROCm and all their previous attempts/projects/resume-driven-development fodder combined.

When I catch myself writing a sentence like It's simply harder to do in order to promote or justify an alternative engineering approach, I try to, well, catch myself before making a really weak argument in favor of an inferior solution. Making life easy for developers is important.
I'm not claiming it's not important and it's not very nice that you say I did.

When you basically have doomed humanity to rely on a single (malicious) company for a technology that is as important as AI. Then maybe, just maybe, the trade off that it is a harder to implement is worth it.

CUDA predates Vulkan by over 8 years.

There's a lot of established ecosystem for CUDA, thanks to Nvidia's investment.

I thought Vulkan was a graphics specific layer and CUDA was specifically for machine learning?
CUDA is general purpose compute, but nvidia also releases cudnn which all the major libraries use because it is fast and good (if a little complex). There’s efforts underway to have a comparable library on open source general compute packages but none as mature or effective as cudnn so people just pay nvidia to use that in practice, which lets them invest even more in pulling ahead.

As an aside, I’ve been kinda surprised that this has existed for as long as it has, but I am probably biased and think Ml acceleration is more important than most large business do today.

Vulkan is designed for all GPU needs, from rendering to general purpose compute.

CUDA is only for general purpose compute.

CUDA is for GPGPU (general purpose GPU) which includes machine learning.

Vulkan is a primarily for graphics but does have options for GPGPU too. Vulkan is however not like OpenGL in that it's fairly close to the hardware in terms of abstraction.

Vulkan has a very atrocious developer experience by GPGPU standards.

The chance of it winning over CUDA is at zero. And that's _before_ considering its API gaps compared to modern OpenCL.

(Yes, even OpenCL is a much better compute API choice than Vulkan. Vulkan does not even have SVM)

You should think more of vulkan as an IR endpoint than the actually usable API here.

Vulkan is well supported by most GPUs because it's so low level. Performance tends to be good everywhere.

What would make vulkan succesful is having APIs that "compile" to this IR. Stuff like vulkan Kompute are good ideas in this direction.

Vulkan is not a suitable API for even implementing Khronos's very own SYCL on top of. SYCL requires shared virtual memory capabilities that Vulkan just doesn't have.
Vulkan is fairly atrocious, GPGPU or not IMO. Obviously it tries to do something intrinsically complex but I've never enjoyed working with it.
Does CUDA have SVM either? Seems like a pretty niche feature IMHO
Yes CUDA does, since Kepler, under the CUDA Unified Memory naming.

It's not a niche feature at all, but one that is essential to lower the barrier for developer adoption.

“There has also been a wide variety of accuracy-degrading performance optimizations like Xformers and Flash Attention, which are great tools if you are open to trading accuracy for performance”

This is incorrect. Those optimizations do identical computations, but leverage memory bandwidth on the gpu more effectively. So there is no accuracy tradeoff there.

Here are a list of potential issues https://github.com/AUTOMATIC1111/stable-diffusion-webui/disc...

That said we (Nod.ai team) will add support for xformers soon so you can opt in for xformers anyway.

Any chance to get SD running on mobile Ryzen APU e.g. Ryzen Pro 4750U (Renoir)?
Short answer no. Long answer "in theory" yes. I tried this [1] but gave up as building rocm + deps takes up to 6h :/ Official statement [2]

[1] https://github.com/xuhuisheng/rocm-build [2] https://github.com/RadeonOpenCompute/ROCm/issues/1587

For anyone on Arch, there is a third-party repository called arch4edu[0] that provides up to date builds of ROCm and its dependencies. On my iGPU, OpenCL sometimes works, sometimes crashes. Even finding a list of supported hardware is close to impossible. The whole situation is just ridiculous and makes AMD look bad.

[0] https://github.com/arch4edu/arch4edu

AMD doesn't actually care.

For them, GPGPU is a pro level feature not worth supporting on most customer GPUs. They are doing much more feature segmentation than NVIDIA ever did.

Can you give SHARK a try and let us know on our discord? We can try to help. People have been using it on older AMD GPUs back to Polaris arch.
nod-ai/SHARK from the original submission is by far the fastest way I've found to run Stable Diffusion on a 5700 XT.

For 50 iterations:

* ONNX on Windows was 4-5 minutes

* ROCm on Arch Linux was ~2.5 minutes

* SHARK on Windows is ~30 seconds

I’ll have to give this a go over the weekend!
> There has also been a wide variety of accuracy-degrading performance optimizations like Xformers and Flash Attention, which are great tools if you are open to trading accuracy for performance ..

I wasn't aware that Flash Attention trades accuracy for performance. Either I have a wrong understanding of what FA is, or this statement is not fully accurate.

Either way - looks like great work

From the flash attention paper:

We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method.

So I assume they are using the approximate version as they also have an exact version.

Thanks for that - I have missed the block-sparse extension of the algorithm when I first read about it. And indeed this seems to be what the author means.
Can someone explain what exactly does nod.ai do? Its not clear at all from their page
Can anyone point me to some examples of what I, as a techie, might want to actually use AI for? Some simple hobby projects?