Hacker News new | ask | show | jobs
by pbalau 12 days ago
What is the difference between unified memory and shared memory?

Shared memory existed since the first CPU with an embedded GPU came to market and you could set in BIOS how much memory goes to what component.

I do have an opinion about how unified memory could be different, but I want a proper explanation.

5 comments

I'm not sure everyone uses the terms consistently, but the difference is that the old "shared" memory was reserving a section to act as VRAM under the control of the GPU, ignored by the OS. The CPU ran the same kind of code pretending there is a "bus transfer" between host memory and graphics memory.

In unified memory, all the memory is host memory and data can go from program to GPU with zero copy movements. The addresses of buffers can be shared via appropriate MMU translation support, so that the application and graphics subsystem are communicating effectively through the basic RAM cache coherency protocols over the same buffers.

Edit to add: Aside from the zero copy transfer potential, it also means dynamic allocation strategies can shift the balance between host and graphics allocations on the fly. Individual image and message buffers can be allocated on the fly instead of setting a static split between the two worlds.

Reserved sounds like it would have been a better term now that I'm reading this many years later.
You got it in one! That's exactly what makes unified memory superior for current use cases, and different from the shared memory woes of old.
That's my understanding, or, maybe a better word would be "guess". The CPU telling the GPU: this is your memory now.
To some degree this is how it already feels like to program basically anything with dma today. You map hardware into an iommu and stop touching it when the hardware is supposed to use it, and then you reclaim it afterwards. So the model from the os feels the same, the difference is that it's not copying the memory into some local memory to operate on it.
Shared memory of the past meant reserving a part of the memory for the GPU, which could then not be used or accessed by the CPU. If the CPU wanted to access something, it had to copy it from the GPU's section of the memory to its own. Unified memory means both just fully share the same memory.
For these in specific, they appear basically transparently to the GPU. There's a lot of software/firmware stuff for this, but also a different hardware architecture - while the RAM is on the CPU die, the nvlink-c2c gives it extremely low latency and 600GB/s bandwidth between the GPU and CPU.
Marketing, mostly? But perhaps also more flexibility with how much memory the GPU can directly access without reserving it.
No. Let’s define terms, as others have pointed out they’re not perfect.

Unified memory is what Apple is doing, other phones do, and many low end built in GPUs have done in PCs for ages. There is only one physical memory pool. Both the CPU and GPU can access it at full speed.

This means no copying between pools of memory. No speed penalty accessing the CPU memory from GPU or vice versa. If the GPU only needs 2 GB to draw the desktop it only uses 2 GB of the pool. Or it can use 45 GB if it needs it and the CPU doesn’t. But all memory has to be the same speed, and that ain’t cheap given how fast GPUs like things. I don’t know if expandable memory is possible, and they use the same bus do they compete for bandwidth. Seems theoretically easier to program for to me.

The opposite is what’s been common in graphics cards since the 2D era. CPU and GPU have their own memory and can talk over PCI/AGP/PCI-E. This is what I think they mean by shared memory, if it’s not what’s the point in touting unified?

In this model if the GPU uses 2 GB of its 12 GB total, the other 10 isn’t available to the OS at full speed and I’m not aware of any operating systems that would use it for programs/cache by default. If the GPU needs 45 GB… too bad. You have to page things in and out of GPU memory over the much slower system bus. Starting a game means loading assets into main memory then transferring them to the GPU (newer tech can accelerate this). But the CPU can have slower memory than the GPU saving money. Memory expansion on the CPU side easy. And the CPU saturating its memory bus has no effect on the speed of the GPU memory bus because it’s physically separate. More complicated memory model but it’s the one everyone uses used to.

Which is better is a matter of opinion and workload needs.

Yes, I know there is an actual difference vs. dedicated GPUs with their own VRAM. I say it's marketing because Apple popularized the unified memory term even though, as you said, it existed in iGPUs long before Apple Silicon and was called shared GPU memory.

> I don’t know if expandable memory is possible

It technically is. These new systems (mostly) get their high bandwidth by using more channels (wider bus) of normal RAM modules. A system that has LPCAMM2 sockets should allow using the same LPDDR5X memory but you'd need a socket per two channels. A typical PC only supports two channels so having four (two sockets) would double the bandwidth.

Bandwidth by going wider, not faster. That makes sense.
System RAM has much lower bandwidth and less predictable access. Notably, the transfer from system to GPU is very slow. About 30x slower. LLMs aren’t designed to queue or parallelise operations to account for this. They just become much slower.