Hacker News new | ask | show | jobs
by wtallis 5 days ago
> I mean is it possible to make unified memory systems with good performance or is it not really feasible due to memory timing/trace length issues?

LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty. I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus to match Apple's Max tier SoCs, so it's somewhat of an open question whether those solutions can offer upgradability at the high end of the market for unified memory systems.

2 comments

> LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty.

LPCAMM2 supports up to 9600MT/s, which appears to be the same speed Apple is using.

> I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus

Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs. It's basically down to most existing DDR5 system boards/sockets having been designed before anyone was trying to run LLMs on consumer hardware, e.g. AM5 has a 128-bit memory bus and you're not changing that without a new socket. But every memory generation gets a new socket anyway, and the existing Threadripper Pro socket has a 512-bit memory bus as well.

Moreover, making the bus wider is "easy" -- the main problem with it is that it adds cost. Apple's least expensive machines use the same 128-bit memory bus as most PCs and the ones with the 512-bit bus cost as much as Threadripper if not more.

> LPCAMM2 supports up to 9600MT/s, which appears to be the same speed Apple is using.

The difference here is in what the standard defines on paper vs what is actually shipping in products and readily available off the shelf. Who's selling a whole system with LPCAMM2 certified for 9600MT/s? Intel's current-gen Panther Lake top of the line laptop chips are rated for 9600MT/s when using soldered LPDDR5x but only 7467MT/s when using LPCAMM2, according to their current datasheet: https://www.intel.com/content/www/us/en/content-details/8721...

That puts the current Intel-with-LPCAMM2 supported memory speed at 1.5 years and counting lag behind Apple's shipping memory speeds. Intel's own shipping memory speed moved past 7467MT/s a few months earlier than even Apple's.

> Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs.

> Moreover, making the bus wider is "easy"

Citations needed. Servers aren't anywhere close to 9600MT/s yet; Intel and AMD are at 6400MT/s. The trace length advantages offered by LPCAMM2 don't necessarily mean the traces for the sixth or eighth channel would be short enough for 9600MT/s (which again, is not yet available even in a 128-bit configuration in shipping hardware). Adding more channels to even a LPCAMM2 configuration means adding more trace length, because only two modules can actually be adjacent to the CPU socket. (Maybe you could get to 512-bit with modules on the front and back of the board while maintaining trace lengths short enough to reach meaningfully higher speeds than regular DDR5, but so far nobody is doing that or even talking about it.)

> Who's selling a whole system with LPCAMM2 certified for 9600MT/s?

The 9600MT/s modules are new and will probably be found at some point this year. Framework already sells LPCAMM2 at 8533MT/s with full validation:

https://knowledgebase.frame.work/what-drammemory-is-supporte...

> That puts the current Intel-with-LPCAMM2 supported memory speed at 1.5 years and counting lag behind Apple's shipping memory speeds.

It turns out Apple isn't getting 9600MT/s either. I assumed that soldering would be getting them at least what LPCAMM2 is rated for, but if you actually do the math, they're getting ~8500MT/s for their most expensive systems and ~7500MT/s for the others.

> Servers aren't anywhere close to 9600MT/s yet; Intel and AMD are at 6400MT/s.

Servers use conservative timings. EXPO memory kits above 6400MT/s are available for Threadripper with 8 channels. And again, these are using traditional DIMMs with longer traces rather than CAMM, but they're still managing an extremely wide bus with close to the same performance.

> The trace length advantages offered by LPCAMM2 don't necessarily mean the traces for the sixth or eighth channel would be short enough for 9600MT/s

CAMM modules use a compression fitting to attach the chips to the system board using approximately the same amount of space as the solder pads would for soldered chips. If you get to the point of having so many channels that the chips are in the way of the other chips then the soldered ones have the same problem.

> (which again, is not yet available even in a 128-bit configuration in shipping hardware).

A single LPCAMM2 module is a 128-bit bus. Every system that uses it has at least that.

> Maybe you could get to 512-bit with modules on the front and back of the board while maintaining trace lengths short enough to reach meaningfully higher speeds than regular DDR5, but so far nobody is doing that or even talking about it.

Nobody is really using a bus that wide with soldered memory either though, outside of the couple of Macs that start at ~$3500 and are getting the same speed Framework does with LPCAMM2.

> Framework already sells LPCAMM2 at 8533MT/s with full validation:

From your link:

> Framework Laptop 13 Pro (Intel® Core™ Ultra Series 3) supports one slot of LPCAMM2 memory up to 96GB at the native 7467 MT/s speed. It is compatible with LPCAMM2 modules with memory speed rated above 7467 MT/s, but the speed will be capped at 7467 MT/s because of the platform limitation.

The modules in question can only theoretically operate at 8533MT/s. Framework has yet to sell a system where the modules actually operate at more than 7467MT/s.

> It turns out Apple isn't getting 9600MT/s either. I assumed that soldering would be getting them at least what LPCAMM2 is rated for, but if you actually do the math, they're getting ~8500MT/s for their most expensive systems and ~7500MT/s for the others.

You're either doing the math wrong, or just plain looking at the wrong systems. Try looking at the M5 generation.

> CAMM modules use a compression fitting to attach the chips to the system board using approximately the same amount of space as the solder pads would for soldered chips. If you get to the point of having so many channels that the chips are in the way of the other chips then the soldered ones have the same problem.

Yes, that's a problem, and Apple has solved it by moving the DRAM on-package. Datacenter GPUs have also solved it that way by putting the DRAM on a silicon interposer to allow even wider bus widths. Soldering standard DRAM packages on the motherboard is not the limit of how memory can be soldered down.

> A single LPCAMM2 module is a 128-bit bus. Every system that uses it has at least that.

Yes, 128 bits at lower speeds. Did you forget that the whole point I'm making here is that the speeds are not the same?

> Nobody is really using a bus that wide with soldered memory either though, outside of the couple of Macs that start at ~$3500 and are getting the same speed Framework does with LPCAMM2.

The Mac Studio with the M3 Ultra is actually running the DRAM at a lower frequency than what Framework and other Intel-based systems could, but more than making up for it in bus width, to provide far more total memory bandwidth than any plausible LPCAMM2-based system that could be built today.

> You're either doing the math wrong, or just plain looking at the wrong systems. Try looking at the M5 generation.

The M5 generation isn't "1.5 years old" and even those aren't all that speed. The M5 Max with the 32-core GPU is ~7200MT/s, while the one with the 40-core GPU is over $4000.

> Yes, that's a problem, and Apple has solved it by moving the DRAM on-package.

There is no "package" here. Apple's processors are soldered to the logic board, as are Intel's in laptops. The DRAM Apple uses is standard LPDDR5 from the normal OEMs. Have a look at the LPCAMM2 module. It has four standard DRAM chips on the top and a connector on the bottom. DDR5 channels are really 32-bits, so the 128-bit module has four channels, four chips. The module is barely any larger than the chips themselves. It's not saving significant space by soldering them, it's just an alternative means of attaching them to the system board in the same place.

> Yes, 128 bits at lower speeds.

At the same speeds Apple was shipping a few months ago. Apple being the first to ship LPDDR5-9600 when it was that recent doesn't imply that it needs to be soldered, it implies that they're a huge company that can pay for early access to the new thing whether it's soldered or not. 9600MT/s LPCAMM2 modules have already been announced -- it's not a technical problem, it's an "Apple and OpenAI are buying out the fastest DRAM right now" problem.

> The Mac Studio with the M3 Ultra is actually running the DRAM at a lower frequency than what Framework and other Intel-based systems could, but more than making up for it in bus width, to provide far more total memory bandwidth than any plausible LPCAMM2-based system that could be built today.

By this logic the thing to beat it is the 8S Xeon servers from almost a decade ago with 48 channels of DDR4-2666. Or existing 2S servers with 24 channels of DDR5-6400.

> The M5 Max with the 32-core GPU is ~7200MT/s,

Ok, so the problem is you doing the math wrong. Note that the MacBook Pro configuration you're talking about has a DRAM capacity of 36GB, compared to 48+ GB for the ones with all the cores enabled and the full memory bandwidth. That 32-core config isn't running the DRAM slower, it's running with a narrower bus and fewer DRAM chips: https://theapplewiki.com/wiki/MacBook_Pro_(16-inch,_M5_Max)

> There is no "package" here. Apple's processors are soldered to the logic board, as are Intel's in laptops.

Denying the difference between putting the RAM on-package vs on the motherboard doesn't make that difference stop being real.

> Apple being the first to ship LPDDR5-9600 when it was that recent doesn't imply that it needs to be soldered

Apple wasn't even close to being the first to ship LPDDR5-9600. Android phones using DRAM at that speed started shipping at the end of 2023, and moved on to 10700MT/s starting in 2024. The situation here is not anywhere close to being one of Apple paying a premium to get faster DRAM chips that other laptop manufacturers can afford. Rather, for most of the past several years, laptop manufacturers (especially on the x86 side) have been unable to buy DRAM chips with a rating slow enough to match what their processors are capable of running at. It's become quite common to see on a Thinkpad spec sheet that eg. the DRAM parts are rated for 7467MT/s but will only operate at 6400MT/s due to processor limitations, then the next year see that the DRAM parts are rated for 8533MT/s but run at 7467MT/s, and so on. LPDDR speed increases have been driven primarily by flagship smartphones, and even the leftover slower-binned parts are faster than what most laptops can handle.

Multiplexed DDR (MRDIMM) can go faster.

But for throughput served with 12 channels have pretty high theoretical even with slower

> LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using;

Does it need to be leading, though? Being median is just fine for what high-RAM systems are intended to be used for.