Hacker News new | ask | show | jobs
by peter_d_sherman 1032 days ago
>"As an aside, new hardware platforms such as NVIDIA Grace Hopper natively support the Unified Memory programming model through hardware-based memory coherence among all CPUs and GPUs. For such systems, HMM is not required, and in fact, HMM is automatically disabled there.

One way to think about this is to observe that HMM is effectively a software-based way of providing the same programming model as an NVIDIA Grace Hopper Superchip."

1) I am curious what the AMD equivalent of nVidia's HMM is, or will be...

2) I am curious if software will be able to be written with HMM (or some higher level abstraction API) such that HMM enabled software will also function on an AMD or other 3rd party GPU...

3 comments

HMM is a Linux thing, not an nVidia thing. https://www.kernel.org/doc/html/v5.0/vm/hmm.html

AMD has much the same variations as nvidia here, some details at https://github.com/amd/amd-lab-notes/blob/release/mi200-memo.... The single memory systems are called APUs. The internet thinks the MI300 (in El Capitan) is one of those. The games consoles and mobile chips are too.

I'm not sure what the limits are in terms of arbitrary heterogenous execution if you want to push the boundaries, e.g. can you JIT amdgpu code into memory you got from mmap and have one of the GPU execution units branch to it? I don't see why not, but haven't tried it.

In principle I suppose a page should be able to migrate between nvidia and amdgpu hardware on a machine containing GPUS from both vendors, though that isn't likely to be a well tested path.

HMM is, I believe, a Linux feature.

AMD added HMM support in ROCm 5.0 according to this: https://github.com/RadeonOpenCompute/ROCm/blob/develop/CHANG...

Note: that isn't the same thing as what the OP describes, at least according to those release notes, but it does fall under the "HMM" umbrella. You still need to specifically allocate your memory with hipMallocManaged before it can be transparently used between the CPU and GPU. Nvidia calls this "unified memory" (and has had it for 10 years now.)

It's confusing, because there are basically three levels of "Heterogeneous Memory Management" in this regard, in order of increasing features and improved programming model:

1. Nothing. You have to both allocate memory with the right allocator (no malloc, no mmap), and also memcpy to/from the host memory to the device, when you want to use it. You still need to "synchronize" with the compute kernel to ensure it completes, before you can see results from a compute kernel.

2. Unified virtual memory. You have to allocate memory with the right allocator (no malloc, no mmap), but after that, you don't need to copy to/from the device memory via special memcpy routines. Memory pages are migrated to/from as you demand them; you can address more memory than your actual GPU has, hence "virtual". You still need to synchronize with the compute kernel to ensure it completes. You can (in theory) LD_PRELOAD a different malloc(2) routine that uses the proper cudaMalloc call or whatever, making all malloc(2) based memory usable for the accelerator, but it doesn't fix systems/libraries/programs that use custom non-malloc(2) allocators or e.g. mmap

3. True heterogeneous memory management. You can use ANY piece of allocated memory, from any memory allocator, and share it with the accelerator, and do not need to copy to/from the device memory. You can use mmap'd pages, custom memory allocators, arbitrary 3rd party libraries, it doesn't really matter. Hell, you can probably set the PROT_WRITE bit on your own executable .text sections and then have the GPU modify your .text from the accelerator. The GPU and CPU have a unified view without any handholding from userspace. You still need to synchronize with the compute kernel to ensure it completes.

Nvidia implements all the features above, while HIP/AMD only implements the first two. Note that AMD has long been involved in various HMM-adjacent work for many years now (HSAIL, various GCC HSA stuff), so it's not like they're coming out of nowhere here. But as far as actual features and "It works today" goes, they're now behind if you're looking at HIP vs CUDA.

I can see how you got here from the release notes, but the conclusions are a bit off. For hardware and kernels that support the full HMM setup with AMD, you get 3 today as long as XNACK is turned on. Systems like Frontier have been using it for some time now.

Also, 2 can be subdivided into systems that implement it by having two allocations, one host one device, and triggering transfers when the GPU might access memory (2.1) and those that implement demand paging (2.2). The HMM support adds demand paging for type 2.2 as well as type 3 on supported hardware, where without it hip had to use either 2.1 or remote PCIE access to provide “unified memory”. Those were dark days, but for current hardware on appropriate kernels appropriately configured, AMD implements memory just as unified as either NVIDIA’s HMM or ATS implementations.

This is not true.

3. is supported by AMD on new hardware, e.g., Frontier. See https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#...

Amazing, thanks for the correction(s)!
Oh, very nice!
AMDs answer will be “nothing” imho.

They’ve really left this area wide open for over a decade now when it’s been extremely clear this is where the market was going.

Their GPU and GPU compute story is a mess, because rocm has the most confusing compatibility story possible . They’ve been late to compute accelerators as well.

I don’t think there’ll be any abstraction layers either. The community as a whole is more than happy to be single vendor. AMD has shown they can’t build compute stacks, not because of technology reasons but purely long term decisions. The community therefore won’t do it for them.

ROCm already supports HMM.

You're not helping anything by going off on some rant based on an assumption and falsehood - this sort of comment is exactly the sort of thing the phrase "FUD" is used to describe.

You’re right that my rant is incorrect on the premise that they don’t have hmm, but it’s because I missed rocm adding it two years ago. So my bad, and unfortunately I can’t edit my post so I’ll leave the link here with my apologies. https://www.phoronix.com/news/Radeon-ROCm-4.3

The reason I missed it is because rocm dropped support for my cards very unceremoniously. At which point I gave up.

I do think the rest of my point outside of the first sentence is valid though. Rocm isn’t reliable to target. Nowhere near CUDA.

That it’s so dependent on what card you have, what OS/kernel you use and is so aggressive with dropping support for older cards, makes the entire ecosystem a mess. CUDA by comparison is so much more ubiquitous.

That becomes chicken and egg with popular libraries adding rocm support because it then ends up targeting such a sliver (and shifting sliver) at that of the market.