Hacker News new | ask | show | jobs
by curt15 772 days ago
I was talking with a friend in HPC lately who said that AMD is actually quite competitive in the HPC space these days. For example, Frontier (https://docs.olcf.ornl.gov/systems/frontier_user_guide.html) is an all-AMD installation. Do scientists actually use ROCm in their code or does AMD have another programming framework for their Instinct chips?
4 comments

I currently have a project with ORNL OLCF (on Frontier). The short answer is yes. Happy to answer any questions I can.
ROCm or HIP? Does it start out with porting a lot from CUDA etc. or starting fresh on top of the AMD APIs?

How much of the project time is spent on that compute API stuff in comparison to "payload" work?

>ROCm or HIP?

I'm not sure that's the right question to ask. Afaik ROCm is the name of that entire tech stack and HIP is AMD's equivalent to CUDA C++ (they basically replicated the API and replaced every "CUDA" by "hip", they have functions called "hipmalloc" and "hipmemcpy").

The repository is located at https://github.com/ROCm/HIP.

My project is ROCm (torch, more or less) and working with OLCF staff I've never heard of HIP in use but based on their training series it is supported[0].

Of course my personal experience isn't exhaustive and it can be inferred from the ongoing training series that it is in use in some cases.

Speaking from personal experience ROCm itself is... Challenging (which I already knew from prior endeavors). We've taken to dev and staging workloads on more typical MI2xx hardware and then working it over to Frontier.

We currently have 20k node hours on Frontier via a Director's Discretion Project[1]. It's a relatively simple application and at the end of the day you have access to significant compute so depending on workload the extra effort for ROCm, etc is still worth it.

[0] - https://www.olcf.ornl.gov/hip-training-series/

[1] - https://www.olcf.ornl.gov/for-users/documents-forms/olcf-dir...

National labs sign "cost-effective" deals. NVIDIA isn't cost-effective. Aurora (at Argonne) is all Intel GPU. Aurora is also a clusterfuck so that just tells you these decisions aren't made by the most competent people.
LANL might disagree given that they just unveiled a new supercomputer with NVIDIA chips [1]. NVIDIA CEO Jensen Huang was even at the unveiling.

[1] https://ladailypost.com/los-alamos-national-laboratory-unvei...

Both Frontier and Aurora bet on unproven future chips. Sometimes it pays off and sometimes it doesn't.
They are competent people, just not in the fields techies want.

When you're a national laboratory and your wallet is taxes from fellow Americans, it is very important that you find a balance between bang and buck. Lest you get your budget slashed or worse.

nvidia absolutely gives deals to national labs and universities. See Crossroads @ LANL, Isambard in the UK, Perlmutter @ LBL. While AMD is being deployed at LLNL and ORNL, Nvidia isn’t done with their HPC game. Maybe not at the leadership level, but we’ll see how Oak Ridge and LANL decide their next round of procurements
"Winning" a national lab definitely confers benefits far beyond just financial ones – these are, by definition, the biggest deployments in the world. Both the technical experience setting these up, and the reputational benefit associated with this, is worth a great, great deal. (I don't know how much money HPE Cray makes, for example, but I'm sure it's not the money it makes that's stopped HPE from quietly sunsetting the brand.)
AMD had pretty much always been competitive in HPC, AI not so much because of software.
An interesting alternative question: "how necessary is ROCm when working with APU?".

CUDA's advantage seemed to me to come mostly from memory management and task scheduling being so poor on AMD cards. If AMD has engineered that problem out of the system, we might be able to get away with using 3rd party libraries instead of these vendor-promoted frameworks.

This is a great question. In the sense that ROCm is pure userspace it's never necessary - make the syscalls yourself and the driver in the Linux kernel will do the same things ROCm would have done.

In practice if you go down that road on discrete GPU systems, allocating "fine grain" memory so you can talk to the GPU is probably the most tedious part of the setup. I gave up around there. An APU should be indifferent to that though.

There will be some setup to associate your CPU process with the GPU. Permissions style, since Linux doesn't let processes stomp on each other. That might be rather minimal and should be spelled out in roct.

Launching a kernel involves finding the part of the address space the GPU is watching, writing 64 bytes to it and then "ringing a doorbell" which is probably writing to a different magic address. There's a lot of cruft in the API from earlier generations where these things involved a lot of work.

Game plan for finding out goes something like:

  1. Compile some GPU code and put it in the host processs
  2. Make the calls into hsa.h to run that kernel
  3. Delete everything unused from hsa to get an equivalent that only uses roct
  4. Delete everything unused from roct to get the raw syscalls
Roct is a small C library that implements the userspace side of the kernel driver. I'd be inclined to link it into your application instead of drop it entirely, but ymmv. Rocr / HSA is a larger C++ library that has a lot more moving parts and is more tempting to drop from the dependency graph.

Going beyond that, you could build a simplified version of the kernel driver that drops all the other hardware. Might make things better, might not. And beyond that there's the firmware on the GPU which might be getting more accessible soon, but iiuc is written in assembly so might not be that much fun to hack on. And beyond that you're on the silicon, where changing it is making a different chip really.