I mean is it possible to make unified memory systems with good performance or is it not really feasible due to memory timing/trace length issues?
It’s possible if you’re willing to go with much slower RAM than GPUs like but CPUs often use. Thats what integrated graphics laptops have done for a long time right?
But can you get high end CPU and GPU performance with unified memory and maintain user upgradable memory in a reasonable way? Thats what I don’t know.
> I mean is it possible to make unified memory systems with good performance or is it not really feasible due to memory timing/trace length issues?
LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty. I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus to match Apple's Max tier SoCs, so it's somewhat of an open question whether those solutions can offer upgradability at the high end of the market for unified memory systems.
> LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty.
LPCAMM2 supports up to 9600MT/s, which appears to be the same speed Apple is using.
> I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus
Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs. It's basically down to most existing DDR5 system boards/sockets having been designed before anyone was trying to run LLMs on consumer hardware, e.g. AM5 has a 128-bit memory bus and you're not changing that without a new socket. But every memory generation gets a new socket anyway, and the existing Threadripper Pro socket has a 512-bit memory bus as well.
Moreover, making the bus wider is "easy" -- the main problem with it is that it adds cost. Apple's least expensive machines use the same 128-bit memory bus as most PCs and the ones with the 512-bit bus cost as much as Threadripper if not more.
> LPCAMM2 supports up to 9600MT/s, which appears to be the same speed Apple is using.
The difference here is in what the standard defines on paper vs what is actually shipping in products and readily available off the shelf. Who's selling a whole system with LPCAMM2 certified for 9600MT/s? Intel's current-gen Panther Lake top of the line laptop chips are rated for 9600MT/s when using soldered LPDDR5x but only 7467MT/s when using LPCAMM2, according to their current datasheet: https://www.intel.com/content/www/us/en/content-details/8721...
That puts the current Intel-with-LPCAMM2 supported memory speed at 1.5 years and counting lag behind Apple's shipping memory speeds. Intel's own shipping memory speed moved past 7467MT/s a few months earlier than even Apple's.
> Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs.
> Moreover, making the bus wider is "easy"
Citations needed. Servers aren't anywhere close to 9600MT/s yet; Intel and AMD are at 6400MT/s. The trace length advantages offered by LPCAMM2 don't necessarily mean the traces for the sixth or eighth channel would be short enough for 9600MT/s (which again, is not yet available even in a 128-bit configuration in shipping hardware). Adding more channels to even a LPCAMM2 configuration means adding more trace length, because only two modules can actually be adjacent to the CPU socket. (Maybe you could get to 512-bit with modules on the front and back of the board while maintaining trace lengths short enough to reach meaningfully higher speeds than regular DDR5, but so far nobody is doing that or even talking about it.)
> That puts the current Intel-with-LPCAMM2 supported memory speed at 1.5 years and counting lag behind Apple's shipping memory speeds.
It turns out Apple isn't getting 9600MT/s either. I assumed that soldering would be getting them at least what LPCAMM2 is rated for, but if you actually do the math, they're getting ~8500MT/s for their most expensive systems and ~7500MT/s for the others.
> Servers aren't anywhere close to 9600MT/s yet; Intel and AMD are at 6400MT/s.
Servers use conservative timings. EXPO memory kits above 6400MT/s are available for Threadripper with 8 channels. And again, these are using traditional DIMMs with longer traces rather than CAMM, but they're still managing an extremely wide bus with close to the same performance.
> The trace length advantages offered by LPCAMM2 don't necessarily mean the traces for the sixth or eighth channel would be short enough for 9600MT/s
CAMM modules use a compression fitting to attach the chips to the system board using approximately the same amount of space as the solder pads would for soldered chips. If you get to the point of having so many channels that the chips are in the way of the other chips then the soldered ones have the same problem.
> (which again, is not yet available even in a 128-bit configuration in shipping hardware).
A single LPCAMM2 module is a 128-bit bus. Every system that uses it has at least that.
> Maybe you could get to 512-bit with modules on the front and back of the board while maintaining trace lengths short enough to reach meaningfully higher speeds than regular DDR5, but so far nobody is doing that or even talking about it.
Nobody is really using a bus that wide with soldered memory either though, outside of the couple of Macs that start at ~$3500 and are getting the same speed Framework does with LPCAMM2.
> Framework already sells LPCAMM2 at 8533MT/s with full validation:
From your link:
> Framework Laptop 13 Pro (Intel® Core™ Ultra Series 3) supports one slot of LPCAMM2 memory up to 96GB at the native 7467 MT/s speed. It is compatible with LPCAMM2 modules with memory speed rated above 7467 MT/s, but the speed will be capped at 7467 MT/s because of the platform limitation.
The modules in question can only theoretically operate at 8533MT/s. Framework has yet to sell a system where the modules actually operate at more than 7467MT/s.
> It turns out Apple isn't getting 9600MT/s either. I assumed that soldering would be getting them at least what LPCAMM2 is rated for, but if you actually do the math, they're getting ~8500MT/s for their most expensive systems and ~7500MT/s for the others.
You're either doing the math wrong, or just plain looking at the wrong systems. Try looking at the M5 generation.
> CAMM modules use a compression fitting to attach the chips to the system board using approximately the same amount of space as the solder pads would for soldered chips. If you get to the point of having so many channels that the chips are in the way of the other chips then the soldered ones have the same problem.
Yes, that's a problem, and Apple has solved it by moving the DRAM on-package. Datacenter GPUs have also solved it that way by putting the DRAM on a silicon interposer to allow even wider bus widths. Soldering standard DRAM packages on the motherboard is not the limit of how memory can be soldered down.
> A single LPCAMM2 module is a 128-bit bus. Every system that uses it has at least that.
Yes, 128 bits at lower speeds. Did you forget that the whole point I'm making here is that the speeds are not the same?
> Nobody is really using a bus that wide with soldered memory either though, outside of the couple of Macs that start at ~$3500 and are getting the same speed Framework does with LPCAMM2.
The Mac Studio with the M3 Ultra is actually running the DRAM at a lower frequency than what Framework and other Intel-based systems could, but more than making up for it in bus width, to provide far more total memory bandwidth than any plausible LPCAMM2-based system that could be built today.
> LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using;
Does it need to be leading, though? Being median is just fine for what high-RAM systems are intended to be used for.
It’s possible if you’re willing to go with much slower RAM than GPUs like but CPUs often use. Thats what integrated graphics laptops have done for a long time right?
But can you get high end CPU and GPU performance with unified memory and maintain user upgradable memory in a reasonable way? Thats what I don’t know.