Hacker News new | ask | show | jobs
by AnthonyMouse 18 days ago
> Lets systems optimize utilization based on need, rather than be confined to specific pools

The trouble with this is that the different types of memory have different characteristics. Latency for ordinary system memory is actually better than it is for GDDR, because GDDR is optimized for bandwidth. RTX 5090 has 1.8TB/s of memory bandwidth with a 512-bit memory bus. The same bus width for DDR5-9600 would have better latency but only a third of the bandwidth.

CPU workloads are generally bounded by latency and GPU workloads are generally bounded by bandwidth, which is why they use two different types.

> Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)

The trouble with this is cost. In principle you could get the same 1.8TB/s of memory bandwidth as the RTX 5090 has, with the better latency of DDR5, by using DDR5 with a 1536-bit bus. This is indeed with multi-socket servers do, two sockets with 768-bit in memory channels per socket, but now check how much those system boards cost.

But the remaining alternatives are both worse. If you use GDDR for the unified memory then GDDR costs more than DDR and you're going to have significantly worse latency for the CPU. If you use DDR without a 3-4 times wider bus than the already-wide GPU then the GPU gets starved for bandwidth.

4 comments

Isn't GDDR also based on a much earlier DDR implementation than DDR5 ?

It also has way better throughput because it's physically surrounding the chip itself and wired in a way that maximises this.

The real problem is interconnect speed and latency. We have made tons of progress elsewhere but AI is exposing that the interconnect in many systems is just not great. Even future PCIE 6.0 is fairly bandwidth constrained compared to 8 channels of DDR memory or the way we solder GDDR next to the chip.

We moved on from AGP and older formats to PCI-E and I think it's time to do that again. And maybe even "slot" based implementations in general for both RAM (system and graphics) and GPUs.

We need consumer and workstations in summary to use pin based stuff like LPCAMM ram. And the interconnect on the motherboard itself needs to be both wider (more bandwidth) and lower latency. This might require moving on from motherboard being 2 dimension only (a flat board) to something like an L shape to gain more physical board space.

How about having a large pool of unified memory and expanding the next layer (L3?) of cache to accommodate more of the CPU's the low-latency RAM usage?
As a rule, increasing the size of cache increases its latency, and how much of it you can use is capped by the quality of your cache management algorithms and the latency of the level above it.

Since CPUs are highly optimized, both increasing the latency of the main memory and increasing the size of L3 will probably lead to larger L3 latency.

We might even decide to put 32GB of high-latency cache on the system board and then 12GB of throughput-optimized main memory close to the GPU. ;)
You meant a 128GB (instead of 12GB)?

And yes, a L4 cache can be one way out of that problem. Another way is making the L3 cache lines wider and working the hell out of improving your management algorithm.

It's not a theoretically impossible problem. It's also not something you can solve automatically with a bit more money or some simple decisions. It's possible this is the best architecture available, but it's not certain by any means.

I mean 12GB, an amount that is typical in such a system today, which you can buy at any computer store.
Yeah but unfortunately I hear trying to get more than that is quite hard
Oh, I entirely misunderstood your comment :)
I think that's basically what Cerebras doing ?
I get all of that already, but stand by my original points: for most consumer, non-data center workloads, the compromises aren’t likely to be noticeable to the end user. We’re not talking about edge cases like local-AI or AAA gaming enthusiasts who want to run software at bleeding-edge capabilities and who will dissect performance deltas between driver versions or overclock their kit for maximum performance, because we’re the edge cases in the marketplace.

Everything is ultimately a compromise of some sort, and modern Unified Memory feels like one of the better compromises out there given the current plateauing of hardware scaling, the growing costs associated with memory and NAND, and the shifting complexity from hardware (more instruction sets, more accelerators, more cores) to software (more abstraction layers, more machine learning).

These are all good points that I agree with but rather than seeing an intractable problem I predict we'll see the role that GDDR would otherwise fill in this scenario replaced by a small block of HBM on the APU die. I don't know if it will ultimately end up unified or not but either way I don't think memory segmentation is the core problem here. Simply not needing to send transfers across the narrow and slow PCIe bus would fix most of the practical problems (at least AFAIK but I'm not an expert).

Transitioning over to wild speculation here, I think that most likely this will be treated as part of an absurdly large L3 (ala 3D V-Cache) or as an additional L4. In either case I expect the latency and power tradeoffs introduced to be tolerated as "good enough" even for the highest end consumer gear. (Actually I wonder if some sort of special case cache would be feasible, with memory addresses flagged by the graphics driver and regular CPU related stuff skipping over it entirely. But by then we've squarely entered the territory of vaguely unhinged rambling on my part.)

Alternatively if the performance caveats are deemed to be important enough to justify the added complexity it wouldn't surprise me to see the HBM treated as an independent memory pool analogous to that of a dGPU. That wouldn't change the current status quo with respect to the GPU APIs but it would significantly ameliorate the memory bandwidth bottleneck for inference workloads and from a software perspective is a drop in replacement. You'd still write the code targeting the dGPU with explicit swapping to RAM but when run on an appropriate APU it would get a massive speedup for free instead of suddenly being starved for bandwidth while also performing unnecessary copy operations.