Hacker News new | ask | show | jobs
by uniqueuid 1478 days ago
Since they are using AMD's accelerators as well [1], I do wonder whether any usage of these will trickle down and give us improvements in ROCm.

Surely the people at these labs will want to run ordinary DL frameworks at some point - or do they have the money and time to always build entirely custom stacks?

[1] AMD Instinct MI250x in this case.

6 comments

I’m not using Frontier, but I am using Setonix which is a large AMD cluster being rolled out in Australia. All of AMD’s teaching materials are about ROCm so this is very much how they’re expecting it to be used.

The real pain for us is that there’s no decent consumer grade chips with ROCm compatibility for us to do development on. AMD have made it very clear they only care about the data centre hardware when it comes to ROCm, but I have no idea what kind of developer workflow they’re expecting there.

The rocm stack will run on non-datacentre hardware in YMMV fashion. A lot of the llvm rocm development is done on consumer hardware, the rocm stack just isn't officially tested on gaming cards during the release cycle. In my experience codegen is usually fine and the Linux driver a bit version sensitive.
I'm surprised you're not using HIP? At least in my experience it seems like HIP is the go-to system for programming the AMD GPUs, in large part because of CUDA compatibility. You can mostly get things to work with a one-line header change [1].

(I work for a DOE lab but views are my own, etc.)

[1] As an example, see the approach in: https://github.com/flatironinstitute/cufinufft/pull/116

HIP is just the programming language/runtime, ROCm is the whole software stack/platform.
Vega64 or Vega56 seems to work pretty well with ROCm in my experience.

Hopefully AMD gets the Rx 6800xt working with ROCm consistently, but even then, the 6800xt is RDNA2, while the supercomputer Mx250x is closer to the Vega64 in more ways.

So all in all, you probably want a Vega64, Radeon VII, or maybe an older MI50 for development purposes.

> Hopefully AMD gets the Rx 6800xt working with ROCm consistently

I am a maintainer for rocSOLVER (the ROCm LAPACK implementation) and I personally own an RX 6800 XT. It is very similar to the officially supported W6800. Are there any specific issues you're concerned about?

I know the software and I have the hardware. I'd be happy to help track down any issues.

That's good to hear.

I might be operating off of old news. But IIRC, the 6800 wasn't well supported when it first came out, and AMD constantly has been applying patches to get it up-to-speed.

I wasn't sure what the state of the 6800 was (I don't own it myself), so I might be operating under old news. As I said a bit earlier, I use the Vega64 with no issues (for 256-thread workgroups. I do think there's some obscure bug for 1024-thread workgroups, but I haven't really been able to track it down. And sticking with 256-threads is better for my performance anyway, so I never really bothered trying to figure this one out)

Navi 21 launched in November 2020 but it only got official support with ROCm 5.0 in February 2022.

With respect to your issue running 1024 threads per block, if you're running out of VGPRs, you may want to try explicitly specify the max threads per block as 1024 and see if that helps. I recall that at one point the compiler was defaulting to 256 despite the default being documented as 1024.

The main issue I have with the idea of Navi 21 is that its a 32-wide warp, when CDNA2 (like MX250x) is 64-wide warp.

Granted, RDNA and CDNA still have largely the same assembly language, so its still better than using say... NVidia GPUs. But I have to imagine that the 32-wide vs 64-wide difference is big in some use cases. In particular: low-level programs that use warp-level primitives, like DPP, shared-memory details and such.

I assume the super-computer programmers want a cheap system to have under their desk to prototype code that's similar to the big MI250x system. Vega56/64 is several generations old, while 6800 xt is pretty different architecturally. It seems weird that they'd have to buy MI200 GPUs for this purpose, especially in light of NVidia's strategy (where A2000 nvidia could serve as a close replacement. Maybe not perfect, but closer to the A100 big-daddy than the 6800xt is to the big daddy MI250x).

--------

EDIT: That being said: this is probably completely moot for my own purposes. I can't afford an MI250x system at all. At best I'd make some kind of hand-built consumer rig for my own personal purposes. So 6800 xt would be all I personally need. VRAM-constraints feel quite real, so the 16GBs of VRAM at that price makes 6800xt a very pragmatic system for personal use and study.

Interesting. So what is your workflow right now?
Develop against CUDA locally. Port my kernels to ROCm, and occupy a whole HPC node for debugging and performance tuning for a week. It’s terrible.

Edit: I should say that their recommendation is to write the kernels in ‘hip’ which is supposed to be their cross device wrapper for both cuda or ROCm. I’m writing in Julia however so that’s not possible.

The AMD software stack has been behind for a long time but I feel like we're finally catching up. I heard that HIP (and hopefully the rest of ROCM) is now supported on the RX6800XT consumer GPU... maybe that could help? BTW my team at AMD has been using Julia for ML workloads for a while. We should get in touch - maybe some of the lessons we learn can be useful to you. My email is claforte. The domain I'm sure you can guess. ;-)
BTW have you tried `KernelAbstractions.jl`? With it you can write code once that will run reasonably fast on AMD or NVIDIA GPUs or even on CPU. One of our engineers just started using it and is pleased with it - apparently the performance is nearly equivalent to native CUDA.jl or AMDGPU.jl, and the code is simpler.
If you are using Julia I would recommend looking at AMDGPU.jl and (pluging my own project here) KernelAbstractions.jl
Can you write SYCL code and compile it to ROCm for production?
> Surely the people at these labs will want to run ordinary DL frameworks at some point

I don't know about that. A lot of these labs are doing physics simulations and are probably happy to stick with their dense-matrix multiply / BLAS routines.

Deep learning is a newer thing. These national labs can run them of course, but these national labs have existed for many decades and have plenty of work to do without deep learning.

> or do they have the money and time to always build entirely custom stacks?

Given all the talk about OpenMP compatibility and Fortran... my guess is that they're largely running legacy code in Fortran.

Perhaps some new researchers will come in and try to get some deep-learning cycles in the lab and try something new.

From my limited exposure to the HPC groups at the labs, there's a mixture of languages in use. It seems that modern C++ is the dominant language for a lot of new projects--some of the people I talked to were working on libraries that aggressively used C++11/C++14 features.

The biggest challenge the national labs face is that there's not really any budget (or appetite) to rewrite software to take advantage of hardware features (particularly the GPU-based accelerator that's all the rage nowadays). You might be able to get a code rewritten once, but an era where every major HPC hardware vendor wants you to rewrite your code into their custom language for their custom hardware results in code that will not take advantage of the power of that custom hardware. OpenMP, being already fairly widespread, ends up becoming the easiest avenue to take advantage of that hardware with minimal rewriting of code (tuning a pragma doesn't really count as rewriting).

Also, while NVidia has been adding extra AI acceleration to their chips AMD has been throwing in extra double precision resources that HPC generally requires. If you're training an AI rather than simulating the climate/a thermonuclear explosion/etc then you're probably better off using NVidia cards but AMD made the right technical investments to get these supercomputer contracts.
It's kind of surprising that nvidia hasn't purchased AMD. It really feels like there's a single company between the two that would be truly effective- AMD for the classic CPU oomph, nvidia for the GPU oomph, combining their strengths in interconnects. It would be a player from the high-end PC to the supercomputer market, without even pretending to go for the low-power market (ARM).
> It's kind of surprising that nvidia hasn't purchased AMD.

One word: antitrust. The discrete GPU market these days consists of Nvidia and AMD, with Intel only just now dipping its toes into the market (I don't think there's anything saleable to retail customers yet). Nvidia buying AMD would make it a true monopoly in that market, and there's no way that would pass antitrust regulators. Nvidia recently tried to buy ARM, and even that transaction was enough for antitrust regulators to say no.

AMD and Nvidia were in talks to merge at one point, apparently the talks fell apart because Nvidia's CEO insisted on being the new CEO of the combined company and AMD would have none of that. So they purchased ATI instead, probably overpaid for it and probably pushed the bulldozer concepept to hard in an effort to prove it was worth it after all.

Nvidia actually used to develop chipsets for AMD processors include onboard GPUs, they did for Intel as well but they had a much more serious relationship with AMD in my estimation. This stopped with the ATI purchase since ATI is nvidia's main competitor the two companies stopped working together. Intel later killed all 3rd party chipset altogether and AMD had to do a lot of chipset work they weren't doing before.

I sometimes wonder what would have happened if they had merged back then. I personally think a Jensen Huang run AMD would have done much better than AMD+ATI did in that era. I could easily see ATI having collapsed. What would the consoles use now? Would nvidia have been as aggressive as it has been without the strategic weakness of now controlling the platform it's products run on?

Intel and AMD have a patent-licensing agreement where Intel licenses their x86 stuff to AMD, and AMD licenses their amd64 stuff to Intel. AFAIK, the moment AMD gets bought by another company, they can no longer use Intel's patents, and the moment that happens, Intel can no longer use AMD's patents. I'm not sure how much of x86/amd64 you can legally implement without infringing on any of these patents, but it might very well result in a really awkward situation.

Sure, the new owners could re-negotiate with Intel, and maybe nothing would change. But who knows? A combined AMD/nVidia might be a sufficient threat to Intel they might pull some desperate moves.

(In some timeline, this turns out to be the boost that makes RISC-V the new "standard" ISA, but I am not so optimistic it is the one we live in.)

I think based on recent history you can argue that NVIDIA is very aware of the potential anticompetitive actions that could result if they kill or even substantially pass AMD.

There really used to be a lot of intra-generational tweaking and refinement, like if you look back at Maxwell there were really at least 3 and I suspect 4 total steppings of the maxwell architecture (GM107, GM204/GM200, and GM206 - and I suspect GM200 was a separate "stepping" too due to how much higher it clocks than GM204 - which is the opposite of what you'd expect from a big chip). Kepler had at least 4 major versions (GK1xx, GK110B, GK2xx, GK210), Fermi had at least 2 (although that's where I'm no longer super familiar with the exact details).

Anyway point is there used to be a lot more intra-generational refinement, and I think that has largely stopped, it's just thrown over the wall and done. And I think the reason for that is that if NVIDIA really cranked full-steam ahead they'd be getting far enough ahead of AMD to potentially start raising antitrust concerns. We are now in the era of "metered performance release", just enough to stay ahead of AMD but not enough to actually raise problems and get attention from antitrust regulators.

Same thing for the choice of Samsung 8nm for Ampere and TSMC 12nm for Turing, while AMD was on TSMC 7nm for both of those. Sure, volume was a large part of that decision, but they're already matching AMD with a 1-node deficit (Samsung 8nm is a 10+, and the gap between 10 and TSMC 7 is huge to begin with) and they were matching with a 1.5 node deficit during the Turing generation (12FFN is a TSMC 16+ node - that is almost 2 full nodes to TSMC 7nm). They cannot just make arbitrarily fast processors that dump on AMD, or regulators will get mad, so in that case they might as well optimize for cost and volume instead. If they had done a TSMC 7nm against RDNA1 they probably would be starting to get in that danger zone - I'm sure they were watching it carefully during the Maxwell era too.

(the people who imagined some giant falling-out between TSMC are pretty funny in hindsight. (A) NVIDIA still had parts at TSMC anyway, and (B) TSMC obviously couldn't have provided the same volume as Samsung did, certainly not at the same price, and volume ended up being a godsend during the pandemic shortages and mining. Yeah, shortages sucked, but they could still have been worse if NVIDIA was on TSMC and shipping half or 2/3rds of their current volume.)

Of course now we may see that dynamic flip with AMD moving to MCM products earlier, or maybe that won't be for another year or so yet rumors are suggesting monolithic midrange chips will be AMD's first product. Or perhaps "monolithic", being technically MCM but with cache dies/IO dies rather than multiple compute dies. But with RDNA3 AMD is potentially poised to push NVIDIA a little bit, rather than just the controlled opposition we've seen for the past few generations, hence NVIDIA reportedly moving to TSMC N5P and going quite large with a monolithic chip to compete.

> Given all the talk about OpenMP compatibility and Fortran... my guess is that they're largely running legacy code in Fortran.

The must used linear algebra library is written in Fortran. There's nothing "legacy" about it, it's just that nobody was able to replicate its speed in C.

I don't remember the exact specifics, but Fortran disallows some of the constructs that C/C++ struggle with aliasing on, so Fortran can often be (safely) optimized to much higher-performance code because of this limitation/knowledge.

Like, it's always seemed like there's a certain amount of fatalism around Undefined Behavior in C/C++, like this is somehow how it has to be to write fast code but... it's not. You can just declare things as actually forbidden rather than just letting the compiler identify a boo-boo and silently do whatever the hell it wants.

Of course it's not the right tool for every task, I don't think you'd write bit-twiddling microcontroller stuff in fortran, or systems programming. But for the HPC space, and other "scientific" code? Fortran is a good match and very popular despite having an ancient legacy even by C/C++ standards (both have, of course, been updated through time). Little less flexible/general, but that allows less-skilled programmers (scientists are not good programmers) to write fast code without arcane knowledge of the gotchas of C/C++ compiler magic.

> I don't remember the exact specifics, but Fortran disallows some of the constructs that C/C++ struggle with aliasing on, so Fortran can often be (safely) optimized to much higher-performance code because of this limitation/knowledge.

For a crude approximation, Fortran is somewhat equivalent to C code where all pointer function arguments are marked with the restrict keyword.

> Like, it's always seemed like there's a certain amount of fatalism around Undefined Behavior in C/C++, like this is somehow how it has to be to write fast code but... it's not. You can just declare things as actually forbidden rather than just letting the compiler identify a boo-boo and silently do whatever the hell it wants.

Well, it's kind more dangerous than C, in this aspect. The aliasing restriction is a restriction on the Fortran programmer; the compiler or runtime is not required to diagnose it, meaning that the Fortran compiler is allowed to optimize assuming that two pointers don't alias.

That being said, in general I'd say Fortran has less footguns than C or C++, and is thus often a better choice for a domain expert that just wants to crunch numbers.

> The must used linear algebra library is written in Fortran.

My understanding is that most supercomputers have the vendor provide their implementation of BLAS (e.g., if it's Intel-based, you're getting MKL) that's specifically tuned for that hardware. And these implementations stand a decent chance of being written in assembly, not Fortran.

Usually C or Fortran superstructure, and assembly kernels.

The clearest form of this is in BLIS, which is a C framework you can drop your assembly kernel into, and then it makes a BLAS (along with some other stuff) for you. But the idea is also present in OpenBlas.

Lots of this is due to the legacy of gotoBlas (which was forked into OpenBlas, and partially inspired BLIS), written by the somewhat famous (in HPC circles at least) Kazushige Goto. He works at Intel now, so probably they are doing something similar.

BLAS itself has been rewritten in Nvidia CUDA and AMD HIP, and is likely the workhorse in this case. (Remember that Frontier is mostly GPUs and the bulk of code should be GPU compatible)

Presumably that old Fortran code has survived many generations of ports: Connection Machine, DEC Alpha, Intel Itanium, SPARC and finally today's GPU heavy systems. The BLAS layer keeps getting rewritten but otherwise the bulk of the simulators still works.

I think you've made a slightly bigger claim than is necessary, which has lead to a focus on BLAS, which misses the point.

The best BLAS libraries use C and Assembly. This is because BLAS is the de-facto standard interface for Linear Algebra code, and so it is worthwhile to optimize it to an extreme degree (given infinite programmer-hours, C can beat any language, because you can embed assembly in C).

But for those numerical codes which aren't incredibly hand-optimized, Fortran makes nice assumptions, it should be able to optimize the output of a moderately skilled programmer pretty well (hey we aren't all experts, right?).

If you are talking about netlib blas/lapack I am very confused by what you are saying because the fastest blas/lapack implementations are in c/c++.
Surprisingly, ROCm support has been getting a lot better over the very recent years. In my experience the pytorch support is essentially seamless between CUDA and ROCm. Also, I know some popular frameworks like DeepSpeed have announced support and benchmarks on it as well: https://cloudblogs.microsoft.com/opensource/2022/03/21/suppo...
Yes, DOE is very interested in DL. I don't work on this personally, but you can see an example e.g. here [1, 2]. You can see in the first link they're using Keras. I'm not up to date on all the details (again, don't work on this personally) but in general the project is commissioned to run on all of DOE's upcoming supercomputers, including Frontier.

[1]: https://github.com/ECP-CANDLE/Benchmarks

[2]: https://www.exascaleproject.org/research-project/candle/

These supercomputer contracts typically have a large amount dedicated to software support. I remember reading on AnandTech (?) that AMD was explicitly putting a bunch of engineers on ROCm for this project. It's one of the reason companies like these contracts so much.
The rocm stack is one of the toolchains deployed on Frontier. With determination, llvm upstream and rocm libraries can be manually assembled into a working toolchain too. It's not so much trickle down improvements as the same code.