Hacker News new | ask | show | jobs
by torrance 1475 days ago
I’m not using Frontier, but I am using Setonix which is a large AMD cluster being rolled out in Australia. All of AMD’s teaching materials are about ROCm so this is very much how they’re expecting it to be used.

The real pain for us is that there’s no decent consumer grade chips with ROCm compatibility for us to do development on. AMD have made it very clear they only care about the data centre hardware when it comes to ROCm, but I have no idea what kind of developer workflow they’re expecting there.

7 comments

The rocm stack will run on non-datacentre hardware in YMMV fashion. A lot of the llvm rocm development is done on consumer hardware, the rocm stack just isn't officially tested on gaming cards during the release cycle. In my experience codegen is usually fine and the Linux driver a bit version sensitive.
I'm surprised you're not using HIP? At least in my experience it seems like HIP is the go-to system for programming the AMD GPUs, in large part because of CUDA compatibility. You can mostly get things to work with a one-line header change [1].

(I work for a DOE lab but views are my own, etc.)

[1] As an example, see the approach in: https://github.com/flatironinstitute/cufinufft/pull/116

HIP is just the programming language/runtime, ROCm is the whole software stack/platform.
Vega64 or Vega56 seems to work pretty well with ROCm in my experience.

Hopefully AMD gets the Rx 6800xt working with ROCm consistently, but even then, the 6800xt is RDNA2, while the supercomputer Mx250x is closer to the Vega64 in more ways.

So all in all, you probably want a Vega64, Radeon VII, or maybe an older MI50 for development purposes.

> Hopefully AMD gets the Rx 6800xt working with ROCm consistently

I am a maintainer for rocSOLVER (the ROCm LAPACK implementation) and I personally own an RX 6800 XT. It is very similar to the officially supported W6800. Are there any specific issues you're concerned about?

I know the software and I have the hardware. I'd be happy to help track down any issues.

That's good to hear.

I might be operating off of old news. But IIRC, the 6800 wasn't well supported when it first came out, and AMD constantly has been applying patches to get it up-to-speed.

I wasn't sure what the state of the 6800 was (I don't own it myself), so I might be operating under old news. As I said a bit earlier, I use the Vega64 with no issues (for 256-thread workgroups. I do think there's some obscure bug for 1024-thread workgroups, but I haven't really been able to track it down. And sticking with 256-threads is better for my performance anyway, so I never really bothered trying to figure this one out)

Navi 21 launched in November 2020 but it only got official support with ROCm 5.0 in February 2022.

With respect to your issue running 1024 threads per block, if you're running out of VGPRs, you may want to try explicitly specify the max threads per block as 1024 and see if that helps. I recall that at one point the compiler was defaulting to 256 despite the default being documented as 1024.

The main issue I have with the idea of Navi 21 is that its a 32-wide warp, when CDNA2 (like MX250x) is 64-wide warp.

Granted, RDNA and CDNA still have largely the same assembly language, so its still better than using say... NVidia GPUs. But I have to imagine that the 32-wide vs 64-wide difference is big in some use cases. In particular: low-level programs that use warp-level primitives, like DPP, shared-memory details and such.

I assume the super-computer programmers want a cheap system to have under their desk to prototype code that's similar to the big MI250x system. Vega56/64 is several generations old, while 6800 xt is pretty different architecturally. It seems weird that they'd have to buy MI200 GPUs for this purpose, especially in light of NVidia's strategy (where A2000 nvidia could serve as a close replacement. Maybe not perfect, but closer to the A100 big-daddy than the 6800xt is to the big daddy MI250x).

--------

EDIT: That being said: this is probably completely moot for my own purposes. I can't afford an MI250x system at all. At best I'd make some kind of hand-built consumer rig for my own personal purposes. So 6800 xt would be all I personally need. VRAM-constraints feel quite real, so the 16GBs of VRAM at that price makes 6800xt a very pragmatic system for personal use and study.

The radeon vii was a great choice for that while it was on sale. I'm going to be quite sad when mine die.
Interesting. So what is your workflow right now?
Develop against CUDA locally. Port my kernels to ROCm, and occupy a whole HPC node for debugging and performance tuning for a week. It’s terrible.

Edit: I should say that their recommendation is to write the kernels in ‘hip’ which is supposed to be their cross device wrapper for both cuda or ROCm. I’m writing in Julia however so that’s not possible.

The AMD software stack has been behind for a long time but I feel like we're finally catching up. I heard that HIP (and hopefully the rest of ROCM) is now supported on the RX6800XT consumer GPU... maybe that could help? BTW my team at AMD has been using Julia for ML workloads for a while. We should get in touch - maybe some of the lessons we learn can be useful to you. My email is claforte. The domain I'm sure you can guess. ;-)
BTW have you tried `KernelAbstractions.jl`? With it you can write code once that will run reasonably fast on AMD or NVIDIA GPUs or even on CPU. One of our engineers just started using it and is pleased with it - apparently the performance is nearly equivalent to native CUDA.jl or AMDGPU.jl, and the code is simpler.
If you are using Julia I would recommend looking at AMDGPU.jl and (pluging my own project here) KernelAbstractions.jl
Can you write SYCL code and compile it to ROCm for production?