| HN Mirror

Note: that isn't the same thing as what the OP describes, at least according to those release notes, but it does fall under the "HMM" umbrella. You still need to specifically allocate your memory with hipMallocManaged before it can be transparently used between the CPU and GPU. Nvidia calls this "unified memory" (and has had it for 10 years now.)

It's confusing, because there are basically three levels of "Heterogeneous Memory Management" in this regard, in order of increasing features and improved programming model:

1. Nothing. You have to both allocate memory with the right allocator (no malloc, no mmap), and also memcpy to/from the host memory to the device, when you want to use it. You still need to "synchronize" with the compute kernel to ensure it completes, before you can see results from a compute kernel.

2. Unified virtual memory. You have to allocate memory with the right allocator (no malloc, no mmap), but after that, you don't need to copy to/from the device memory via special memcpy routines. Memory pages are migrated to/from as you demand them; you can address more memory than your actual GPU has, hence "virtual". You still need to synchronize with the compute kernel to ensure it completes. You can (in theory) LD_PRELOAD a different malloc(2) routine that uses the proper cudaMalloc call or whatever, making all malloc(2) based memory usable for the accelerator, but it doesn't fix systems/libraries/programs that use custom non-malloc(2) allocators or e.g. mmap

3. True heterogeneous memory management. You can use ANY piece of allocated memory, from any memory allocator, and share it with the accelerator, and do not need to copy to/from the device memory. You can use mmap'd pages, custom memory allocators, arbitrary 3rd party libraries, it doesn't really matter. Hell, you can probably set the PROT_WRITE bit on your own executable .text sections and then have the GPU modify your .text from the accelerator. The GPU and CPU have a unified view without any handholding from userspace. You still need to synchronize with the compute kernel to ensure it completes.

Nvidia implements all the features above, while HIP/AMD only implements the first two. Note that AMD has long been involved in various HMM-adjacent work for many years now (HSAIL, various GCC HSA stuff), so it's not like they're coming out of nowhere here. But as far as actual features and "It works today" goes, they're now behind if you're looking at HIP vs CUDA.