Hacker News new | ask | show | jobs
by stego-tech 10 days ago
The Unified Memory pool is what will continue to be the “game changer” in systems architecture, especially outside of data centers.

The reality is even cutting edge games and consumer workloads don’t actually take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory. Even local AI use cases don’t substantially or meaningfully benefit from faster memory, at least to average consumers.

A unified memory pool does two things:

1) Lets systems optimize utilization based on need, rather than be confined to specific pools

2) Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)

So at a time when memory is expensive, unified pools make more sense. Even when memory becomes cheap and plentiful again, it’s just practical at this point to allocate a larger overall pool instead of managing discrete sets.

The one big drawback is security. A shared memory pool means side-channel attacks against memory from the GPU or CPU could potentially compromise the other as well, meaning memory-safe designs are going to be critical to security going forward (which is good for Rust adherents, I figure).

25 comments

> Lets systems optimize utilization based on need, rather than be confined to specific pools

The trouble with this is that the different types of memory have different characteristics. Latency for ordinary system memory is actually better than it is for GDDR, because GDDR is optimized for bandwidth. RTX 5090 has 1.8TB/s of memory bandwidth with a 512-bit memory bus. The same bus width for DDR5-9600 would have better latency but only a third of the bandwidth.

CPU workloads are generally bounded by latency and GPU workloads are generally bounded by bandwidth, which is why they use two different types.

> Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)

The trouble with this is cost. In principle you could get the same 1.8TB/s of memory bandwidth as the RTX 5090 has, with the better latency of DDR5, by using DDR5 with a 1536-bit bus. This is indeed with multi-socket servers do, two sockets with 768-bit in memory channels per socket, but now check how much those system boards cost.

But the remaining alternatives are both worse. If you use GDDR for the unified memory then GDDR costs more than DDR and you're going to have significantly worse latency for the CPU. If you use DDR without a 3-4 times wider bus than the already-wide GPU then the GPU gets starved for bandwidth.

Isn't GDDR also based on a much earlier DDR implementation than DDR5 ?

It also has way better throughput because it's physically surrounding the chip itself and wired in a way that maximises this.

The real problem is interconnect speed and latency. We have made tons of progress elsewhere but AI is exposing that the interconnect in many systems is just not great. Even future PCIE 6.0 is fairly bandwidth constrained compared to 8 channels of DDR memory or the way we solder GDDR next to the chip.

We moved on from AGP and older formats to PCI-E and I think it's time to do that again. And maybe even "slot" based implementations in general for both RAM (system and graphics) and GPUs.

We need consumer and workstations in summary to use pin based stuff like LPCAMM ram. And the interconnect on the motherboard itself needs to be both wider (more bandwidth) and lower latency. This might require moving on from motherboard being 2 dimension only (a flat board) to something like an L shape to gain more physical board space.

How about having a large pool of unified memory and expanding the next layer (L3?) of cache to accommodate more of the CPU's the low-latency RAM usage?
As a rule, increasing the size of cache increases its latency, and how much of it you can use is capped by the quality of your cache management algorithms and the latency of the level above it.

Since CPUs are highly optimized, both increasing the latency of the main memory and increasing the size of L3 will probably lead to larger L3 latency.

We might even decide to put 32GB of high-latency cache on the system board and then 12GB of throughput-optimized main memory close to the GPU. ;)
You meant a 128GB (instead of 12GB)?

And yes, a L4 cache can be one way out of that problem. Another way is making the L3 cache lines wider and working the hell out of improving your management algorithm.

It's not a theoretically impossible problem. It's also not something you can solve automatically with a bit more money or some simple decisions. It's possible this is the best architecture available, but it's not certain by any means.

I mean 12GB, an amount that is typical in such a system today, which you can buy at any computer store.
I think that's basically what Cerebras doing ?
I get all of that already, but stand by my original points: for most consumer, non-data center workloads, the compromises aren’t likely to be noticeable to the end user. We’re not talking about edge cases like local-AI or AAA gaming enthusiasts who want to run software at bleeding-edge capabilities and who will dissect performance deltas between driver versions or overclock their kit for maximum performance, because we’re the edge cases in the marketplace.

Everything is ultimately a compromise of some sort, and modern Unified Memory feels like one of the better compromises out there given the current plateauing of hardware scaling, the growing costs associated with memory and NAND, and the shifting complexity from hardware (more instruction sets, more accelerators, more cores) to software (more abstraction layers, more machine learning).

These are all good points that I agree with but rather than seeing an intractable problem I predict we'll see the role that GDDR would otherwise fill in this scenario replaced by a small block of HBM on the APU die. I don't know if it will ultimately end up unified or not but either way I don't think memory segmentation is the core problem here. Simply not needing to send transfers across the narrow and slow PCIe bus would fix most of the practical problems (at least AFAIK but I'm not an expert).

Transitioning over to wild speculation here, I think that most likely this will be treated as part of an absurdly large L3 (ala 3D V-Cache) or as an additional L4. In either case I expect the latency and power tradeoffs introduced to be tolerated as "good enough" even for the highest end consumer gear. (Actually I wonder if some sort of special case cache would be feasible, with memory addresses flagged by the graphics driver and regular CPU related stuff skipping over it entirely. But by then we've squarely entered the territory of vaguely unhinged rambling on my part.)

Alternatively if the performance caveats are deemed to be important enough to justify the added complexity it wouldn't surprise me to see the HBM treated as an independent memory pool analogous to that of a dGPU. That wouldn't change the current status quo with respect to the GPU APIs but it would significantly ameliorate the memory bandwidth bottleneck for inference workloads and from a software perspective is a drop in replacement. You'd still write the code targeting the dGPU with explicit swapping to RAM but when run on an appropriate APU it would get a massive speedup for free instead of suddenly being starved for bandwidth while also performing unnecessary copy operations.

> The reality is even cutting edge games and consumer workloads don’t actually take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory

Game dev here. For anyone reading this - it’s not because we’re lazy, it’s because _it’s really hard to do_.

One of the biggest differences between the current generation consoles and the current gen PCs is unified memory.

I live with a game dev myself, so I get it. Hell, it's hard even for PC developers who want to do things without leaning on abstraction layers or existing engines. Managing multiple discrete memory pools, asset swaps or calls between them, getting the respective subsystems to exchange data at just the right time so as not to impact other code and drag down performance - it's fucking hard in general.

A unified pool of memory suddenly makes that simultaneously easier, but also far more flexible, which frees up developer time and bandwidth to focus on other, more important tasks.

How much of that difficulty comes from the chosen game engine? I assume the engine is the primary factor in how resources are allocated.
Both lots and none at the same time. The engines definitely make decisions for you but with unreal (for example) you can modify the RDG any way you see fit.

The problem is that when you need something in gpu you have to go through RAM first (unless you have DMA which is a more recent addition). That doesn’t just add latency it also adds an extra step of cache invalidation, so you have to plan for that from the highest level of gameplay. If you need to prepare for a GPU memory miss _and_ a CPU memory miss as a worst case all the time, it’s very hard to make good use of the bandwidth in the best case

One related question that you need to follow that with is the associated costs of switching the whole studio to another engine that's technically better, or if proposing teach studio tailor-make their own engine the costs of that engineering, if presumably they have or learn the expertise to surpass whatever they're using currently.

I'm not a game developer, but it would also seem to be a link between resource usage by the engine, and whatever content the production side are making. For all the commentary about how brilliant the id software engines are, if you examine the levels you pass through they're also very efficient with what they demand out of the engine - it's like an orchestra playing well together, not one instrument that means you can do anything.

I think much of the difficulty is just that, for example, the 1.8 TB/s of an RTX 5090 is a lot of bandwidth for a game to use. That's over 50,000 4k textures per second at 32bpp.
I agree with you in theory. A couple of points - that’s currently the most experience and high performing card on the market. Most people on steam are using an RTX 3060 which has more like 360GB/s. That’s a factor of 6. How do you design resource usage that scales with that amount of extremity? (We try to, fwiw).

That spec is also a throughput measured per second whereas our frame rates are much higher than 1/s. At 60hz, that’s now between 140 and 800 textures a frame. If you miss _one_ you don’t get that back.

A single main character in a game can be 2-5 regular textures, plus all of the extra mapping textures we have these days. Now do landscapes, environments, props, background videos, and it all adds up. 4k textures are pretty universally used. If you look at a tiny object up close we need a higher res texture to be able to show it neatly.

You also have memory pressure - raytracing makes heavy use of VRAM so you have to make the tradeoff of how much do you want to allocate to caching lighting, vs how much you want to keep textures and geo around.

Lastly, as you say, actually keeping up with 360GB/s from the CPU side is tough. If you require any transformation or CPU operations that’s just not going to happen. If you need to pull from disk, even on an NVMe drive reading synchronously, the max throughput is < 10% of that, and that assumes you are actually reading 360GB from disk. If you pause to do anything else, you’ll significantly slow that down. Players also generally don’t like it if we thrash their NVMe disks :)

All good points.

Absolutely an RTX 3060 is a more normal gamer GPU than the 5090, but you're also not playing in 4k without DLSS on a 3060. Drop to the most common resolution on Steam (1080p), and turn on DLSS and you've basically cancelled out that 6x factor in bandwidth. Even if the 3060 had more bandwidth, it doesn't have enough processing power for native 4k gaming in typical games. So 360 GB/s is still a lot of bandwidth for the resolution most 3060 gamers are using.

Playing at 1080p doesn’t reduce your texture size, for the most part. You still use those 4k textures because you’re only seeing a subset of the texture projected at a close distance. We’re still using 4k textures for terrain brushes to cover the 6km open worlds.

DLSS isn’t just a magic on switch for free perfect up scaling. If you rendered at 720p and DLSS’ed up to 1080 it’s still going to look pretty rubbish.Its always surprising to me just how many people have 1080 monitors though given we’ve had more than that for two generations of consoles.

And lastly - all the same points still apply about frame rate (which can be more than 60) and memory bandwidth per frame and cache invalidation etc at 360GB/S, as they do at 1.8TB/s

That sounds like a lot, but: modern renderers do between 20 to 40 passes, many of them in screen space. And each screen space pass typically reads from at least two input images, sometimes 3 or 4 even with optimally packed inputs. At 60fps that can quickly get up to way over 2000 full screen buffer reads per second and more for less than optimal access patterns in some algorithms. That also doesn't account for texture access during shading passes, which are somewhat random memory accesses.
Very true, but I'll point out that even those 2000 full screen reads per second at 4k are only 4% of the 5090's bandwidth. Sacrificing some of that speed for a unified memory architecture seems like a good trade.

Plus, DLSS can greatly reduce the bandwidth requirements for 4K gaming.

I'm being very, very conservative with my estimates here. Based on the renderers I know, I could have easily tweaked the numbers to go up to 8000 full screen texture reads per second. That doesn't include texture or geometry or BVH reads or any memory writes. That is all in addition to those operations.
What? It's incredibly easy to take full use of memory bandwidth. For example, put proper volumetric smoke/fire/explosion sim in your game. But game developers don't do that because they are lazy.
No, we don’t do it because the tradeoff isn’t worth it. A gpu based particle sim is very difficult to do well - it’s easy (but computationally expensive) to do a volumetric sim, but when you want that simulation to interact with world geometry correctly it comes with an explosion in complexity and performance.

I promise you want our games to look as good as you want them to look.

How does interaction with world geometry come with an explosion in complexity and performance? Advection has almost same cost regardless of if some cells are solid or not. It's one extra line in your shader + 1 bit per cell. JFA to build solid mask.
And conveniently, by making your machine non upgradeable, it allows the manufacturer to enforce market segmentation / charge a huge premium for small RAM upgrade (a la Apple)
It doesn't -have- to be that way necessarily...

LPCAMM2/SOCAMM2 exist, heck I think Framework is using LPCAMM2 in one of their new laptops.

Heck, I'm willing to bet that a lot of manufacturers would rather go that route than soldered in, if for no other reason than the relative cost of warranty work between the two.

However, people probably need to stop being obsessed with ultrathin laptops for that to happen.

> However, people probably need to stop being obsessed with ultrathin laptops for that to happen.

I've never been able to understand this. Once we made it down to ~20 mm (which for the record still accommodates dual-stacked SO-DIMMs, a 2.5 inch bay, and a user replaceable battery but not an RJ45 jack) I don't understand what the practical impact of any further reduction is supposed to be. Regardless of how thin you make it the thing will still be a massive rectangle that you can't flex or press on.

> Regardless of how thin you make it the thing will still be a massive rectangle that you can't flex or press on.

There's very wide variation between laptops in how noticeably they'll flex or yield or creak when pressed. Laptops with a build quality that actually feels solid are far from being ubiquitous or even a majority.

Doubling the thickness of my MacBook Air would probably make it regress on that solid feeling, unless the weight was also significantly increased.

And regardless of whether current laptop form factors could accommodate a 2.5" drive, there's no use in doing so. That drive form factor is entirely obsolete for laptops and is just a waste of space and materials, and has been for about a decade.

I wasn't saying that I want a 2.5 inch drive, I was merely listing off a number of rather large things that fit just fine within a 20 mm budget.

I'm not sure why you seem to think that making something thicker would reduce the stiffness or strength. It's generally the opposite - see the concept of a torsion box. Anyway that wasn't the point. The point was that regardless of how thin you make the thing it will forever remain a cumbersome and delicate item that you have to treat with care when packing so what meaningful positive impact does shaving off those last few mm have? It's never made any sense to me.

They aren't, that was a push from manufacturers and PR. Find me one person that asked for a thinner phone after the iPhone 4
Sir! I am typing this on a Lenovo Carbon X1, with soldered on ram, and you are EXACTLY CORRECT!

I would much prefer two SODIMM sockets with the option to go to 32MB shared video memory, or DDR4/DDR5. Give me OPTIONS!

I came here to say just this myself! Modern DIMM formats make SFF/portable builds with unified memory pools far more plausible than prior designs. There's absolutely no reason desktop machines couldn't implement similar DIMM formats or design a new board standard around something similar.

Unified memory doesn't have to be soldered on or serviceable. That's a choice Apple made because it fit their product vision, but it's not mandatory in the slightest.

Yup - we need pin based memory. Period. It's a physics thing.

CPUs don't slot in for a reason

There is LPCAMM2, if manufacturers want to use it.

So, it does not have to be soldered.

LPCAMM2 is available in real systems at 7467MT/s and 120ns latency, vs apple (and intel) at 9600MT/s (and apple soldered memory at 100ns latency).

I don't know how linear or sensitive CPU and GPU benchmarks are to such a 20% slowdown, but i don't think Apple wants to pay it. And it looks like the next generation will be even closer to the SOC.

LPCAMM2 is also brand new. It likely will improve a lot.

We're also hitting the limit of DDR5 here (before moving to multiplexed)

I would guess if you had LPCAMM2 located physically around the CPU (one or two on each of the 4 CPU edges) you could also reduce that latency.

Its still further away than the Ram on a packaged CPU and latency is limited by speed of light/electrons on that scale.
how about the LPCAMM route? Framework uses LPCAMM2 in 13 Pro laptop mainboards and claims that it satisfies the iGPU and NPU hardware without needing soldered RAM
Until LPCAMM2 came along, using low power LPDDR RAM meant soldiering RAM to the motherboard.

If you wanted to get sleep right and improve battery life, that was the trade off.

> to get sleep right

Thought getting sleep right was something that happened before MS decided they need to be able to wake your PC any time they want and not hardware related much.

Macs were known for far longer standby times while sleeping long before MS completely screwed the pooch with their "modern" standby.
Is that required or just a choice Apple made?
What do you mean by required? Apple's prices are notoriously disconnected from the cost of manufacturing.
I mean is it possible to make unified memory systems with good performance or is it not really feasible due to memory timing/trace length issues?

It’s possible if you’re willing to go with much slower RAM than GPUs like but CPUs often use. Thats what integrated graphics laptops have done for a long time right?

But can you get high end CPU and GPU performance with unified memory and maintain user upgradable memory in a reasonable way? Thats what I don’t know.

> I mean is it possible to make unified memory systems with good performance or is it not really feasible due to memory timing/trace length issues?

LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty. I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus to match Apple's Max tier SoCs, so it's somewhat of an open question whether those solutions can offer upgradability at the high end of the market for unified memory systems.

> LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty.

LPCAMM2 supports up to 9600MT/s, which appears to be the same speed Apple is using.

> I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus

Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs. It's basically down to most existing DDR5 system boards/sockets having been designed before anyone was trying to run LLMs on consumer hardware, e.g. AM5 has a 128-bit memory bus and you're not changing that without a new socket. But every memory generation gets a new socket anyway, and the existing Threadripper Pro socket has a 512-bit memory bus as well.

Moreover, making the bus wider is "easy" -- the main problem with it is that it adds cost. Apple's least expensive machines use the same 128-bit memory bus as most PCs and the ones with the 512-bit bus cost as much as Threadripper if not more.

> LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using;

Does it need to be leading, though? Being median is just fine for what high-RAM systems are intended to be used for.

You mean Apple prices are notoriously over priced, over hyped, under powered, and

"Abdul Jabar, couldn't have made these prices, with a sky hook."

both. soldered ram is faster. also Apple don't want to offer upgradblity after purchase.
Don't I/you wish. The mechanical junction adds no delay, only manufacturing expense, and the delay of purchasing new systems to keep up with OS bloat.

Actually the opposite is true. Socketed RAM can be made to overclock and adjust timings, while soldered ram, no. Two Lenovo's one soldered ( Carbon X1 ), one T590, one slot: Crucial 16GB, 260-pin SODIMM, DDR4 PC4-19200. Exact same processor, the X1 is DDR3 soldered on 532.0 MHz PC3-1066. The T590, has DDR4, PC4-19200, 1200Mhz.

Both have a Core i7 8665U... and the T590 is much faster, with socketed ram.

I think you'll find that in the current day, high speed LP(?)DDR5 requires a better signal path than what the SODIMM can provide. Which is why laptop makers initially moved to soldered RAM before moving to CAMM (probably only for the high end ones).
A note that this was rather common on the days before PC clones took off.

The vertical integration many associate with Apple, was the common approach to most 8 and 16 bit home computers.

Naturally after all these years, many PC vendors want their margins back, and thus the phenomenon of everyone going back to vertical integration, especially in form factors that are ideal for such, like laptops, tablets and phones.

So the option boils down to classical desktops, or being picky on which laptops to buy.

Maybe I won't care about upgradeability right now. The architecture is clearly in flux, the roles of traditional "CPU" and "GPU" are rapidly evolving. Maybe in 5 years, or even 3 years, a brand-new machine from 2026 won't be worth upgrading for a new role due to a seriously different architecture, but would only be relegated to do something "traditional".
I wish manufacturers could consider a hybrid approach. There should be no reason an architecture can't support both unified memory (effectively L4(?) cache), and cheaper, upgradeable system memory on sticks for old-school application use.
Upgradable memory and unified memory aren't entirely mutually exclusive. You can design a chip that uses DDR5 and has a decently-powerful iGPU that can use that whole memory pool. But you'll be starving that GPU of bandwidth relative to what you'd achieve with soldered LPDDR, and it's not really worth the trouble of building a large iGPU unless you're also going to feed it with the fastest memory you can reasonably put down.

If you look at eg. an Intel laptop chip, you'll see they design and build a memory PHY that can interface with either DDR5 or LPDDR5x. They don't support splitting it to have one controller operating with DDR5 and the other with LPDDR5x, for fairly obvious reasons: more complex hardware, harder for software/operating systems to manage optimally, and not a lot of benefits to drive demand and justify the expenses. The speed difference between LPDDR5x and DDR5 isn't really large enough to use LPDDR5x as an L4 cache; it would be more like two different NUMA nodes, with complications for laptop power management.

If you want somebody to build a chip with more than the usual 128-bit bus and make some of the memory controllers use LPDDR and some DDR5, then you're asking for a significant increase in chip cost due to the extra memory PHYs and pin count. That cost is only justified if almost all products using the bigger chips are going to actually take advantage of the full complement of memory controllers.

Are there no PCIe standards that are sufficient to support both use cases?

What happened to PCIe 8 and CXL?

AFAIK PCIe6 just started getting implemented in hardware last year... PCIe7 Spec was just released last year too...

PCIe6 is a much larger change than 'just bump up the transfer rate', the encoding changed too (on top of the new code length, it's no longer NRZ,) so everyone needed to design and validate both the new encoding block, negotiation, etc etc.

That said, I'm guessing PCIe7 will be a 'smoother' transition from PCIE6, i.e. we might see 7.0 products in 2027. That will theoretically get you ~240GB/sec, on an x16 link, or hypothetically a little less than the hypothetical max of a current Strix Halo. (I'm guessing however, that PCIe protocol overhead will make the difference larger.)

Don't really buy the economic argument. For 99% pf all workloads you need at least an order of magnitude more system memory than gpu memory.

Most systems barely need more gpu memory than what is required for video, browsing etc.

Just because we found a new usecase doesn't flip that on its head.

Besides, I want to keep doing what I'm doing today. So if I need 128GB today and my local AI needs 128 GB then I'd need 256 GB to keep doing the same work.

The argument rather seems to be that we shouldn't use such expensive memory on the GPU. Which might be true if you only want to do inference on it.

Jensen Huang has publicly stated he wants a future where "AI" agents use more PC computers than people.

It is ambitious, and absurd... like all CEOs that eventually go loopy. =3

Unified memory is only a feature because NVidia so aggressively uses VRAM for market segmentation.

The 5090 ($2k MSRP but realistically $3-3.5k) is almost the same as the RTX 6000 Pro (~$10k). Same memory bandwidth (1800GB/s). Slightly different CUDA cores (21k vs 24k). Big difference? VRAM (32GB vs 96GB).

NVidia ultimately doesn't want to upset this segmentation so the RTX Spark will never undermine their other offerings. This is why I think Apple has a real market opportunity if they choose to embrace it.

To this day I do not get why Intel doesn't just offer massive memory options for their cards. Just charge what it costs to add the extra memory, no upcharge, and they will never be able to keep up with demand. Cheap VRAM is enough to justify a lot of open source investment into challenging CUDA.
> To this day I do not get why Intel doesn't just offer massive memory options for their cards.

They seem to? Intel Arc is the cheapest option by far for a discrete card with 32GB VRAM.

That’s not massive, though. Make it 96GB at $2,000 (ok, probably impossible right now, but they could have before the surge in prices) and you’ll see developers work really hard to make AI tooling work for their cards, CUDA be damned. The same goes for AMD.

It’s like they both want to rely on market segmentation for VRAM too but fail to realize that it’s their only potential inroad right now.

If you buy three 32GB GPUs, that's 96GB total at a very reasonable price. An AI model splits easily by layers, so running on multiple GPUs is quite feasible.
Doesn't split as easily on an Intel GPU as ona NVIDA GPU though, regarding software support. Sure, it's probably not too difficult if you know what you're doing, but not sure how big that market would be.
They took longer than everyone expected and then shortly after release they made announcements that made people worry that Intel might kill the project the way they tend to kill GPU projects.

(I still kinda want to get one tho.)

Missed a zero here.

Needs 320 GB Vram

Memory is just one part. AMD has had offerings competitive to NVIDIA for quite some time, but nobody uses AMD cards.

The biggest advantage with NVIDIA is CUDA.

> but nobody uses AMD cards

AMD is selling every MI card it makes, and the market wants more of them.

They are only selling because Nvidia is hard to get, and something is better than nothing.
I have so many questions… Since Apple already sells unified memory systems, what is the market opportunity you envision? Do you see Nvidia and Apple as competitors, and how? (And I’m not suggesting they’re not, necessarily, but I want to hear where you’re coming from, and they do have very different markets.) Hasn’t Apple used storage size (RAM & disk) for market segmentation for decades? And how does a machine with 128GB unified mem not potentially cut into some people’s reasons for wanting a 96GB GPU?
Apple offers relatively affordable options for a high-memory workstation that uses unified memory. They previously offered 256/512GB Mac Studios (both discontinued). Because of this they can keep larger models in memory.

BUT you just can't compete with NVidia performance for LLM workloads (mostly inference) for two reasons:

1. The memory bandwidth just can't compete with a 5090 (1800GB/s). The best current Mac is ~900GB/s. That directly caps tokens/sec and might be manageable but there's another problem; and

2. The raw FLOPS just can't compete with even a 5090. It probably needs to natively support FP4/FP8 to at least maintain a number format parity with NVidia. But beside that, NVidia just has more raw FLOPS.

According to Google, an M5 Max does ~70 FP16 TFLOPS while a 5090 does 380. If Apple can close that gap to at least be competitive and also hold larger models in shared VRAM, that would be a competitive advantage and it would directly attack NVidia's market segmentation.

The Mac Studio last came out March last year. So we may get an update in Q3. Many are pinning their hopes on this. But it might not happen until next year. When it was released the M4 was the state of the art and it came with either the M4 Max or M3 Ultra (which, as I understand it, is basically 2 M3s stuck together, kind of). What people are hoping for is an M5 Ultra with >1000GB/s of memory bandwidth, ideally 200+ FP16 TFLOPS and hopefully FP4/FP4 support.

You can chain Mac Studios together into a cluster with TB5 too.

But it's reasonably likely that the next Mac Studio will be only incrementally better than the last generation.

I'm not the person you're replying to, but I wholeheartedly agree with them...

Quick background: doing AI inference requires three things. Lots of memory, lots of memory bandwidth, and of course plenty of compute that has access to that memory.

Quick reference: nVidia 5090 has 1,792 GB/sec bandwidth. 3090 gets about 1000 GB/sec. DGX Spark and AMD 395 whatever get about 275 GB/sec.

Apple M1 Max gets 400GB/sec, M5 Max gets 614GB/sec. Ultra variants get 2x that bandwidth, base variants get 1/2 that bandwidth. However... their compute is rather weak.

Right now, Apple's offerings are juuuuuust fast enough to run dense 27B models at usable speeds at like, 10% of the performance/watt of nVidia. They're world-leading general purpose CPUs but not killer GPUs.

By all accounts, these Windows PCs nVidia is touting seem to have DGX Spark like performance, which is less than impressive. Same with the upcoming AMD AI-oriented consumer stuff.

The other context here is that running your own AI at home is just starting to become feasible in terms of open model availability and the ability to run it at usable speeds. Many are interested in it for reasons of privacy, security, and cost certainty vs. buying tokens.

    Since Apple already sells unified memory systems, what 
    is the market opportunity you envision?
nVidia and AMD can't make their consumer offerings too good at AI, because that risks interfering with their higher-margin data center sales.

(And, let's face it. Even if nVidia did release a 6090 with 64-128GB of memory for an affordable price, consumers wouldn't get their hands on them anyway because people would just start filling data centers with them)

So.

Now you see Apple's opportunity, right? No data center sales to interfere with. No relationship with nVidia or AMD to worry about.

They could choose to make an absolute beast of a home AI machine. The M5 Ultra, if announced, might be that. It's admittedly a niche market, but people are already buying 64GB+ Macs faster than Apple can make them and they're fetching high prices on the used market as well.

The only real questions are if this market is even something Apple would find time to care about, and if they could secure enough DRAM to make a go at it. They are enormous obviously but they're feeling the RAM pinch just like everybody.

They use different technology for their VRAM though. Apple, AMD Strix and NVidia DGX/RTX Spark use LPDDR, whereas discrete cards will be either GDDR or HBM. That directly impacts the memory bandwidth figures. As for compute available, Apple and AMD still have very good figures there for what's essentially a general-purpose iGPU that ships as part of the stock system, rather than a special-purpose piece of dedicated hardware.
The M5 has 16 dedicated ‘Neural Engine’ cores and a ‘Neural accelerator’ in each of its conventional GPU cores. It’s been pretty special-purpose juiced for inference.
When it comes to the very largest models the ANE seems to be only marginally useful for prefill. The M5 Neural Accelerators (NAX) help a lot but at a real cost wrt. power and thermals.
There’s something else. Memory size.

Even if a Mac isn’t the fastest in raw numbers it may be faster if it can load the whole model in its ram (went up to 512 GB before shortages) than a couple 32 GB cards could with the data having to be constantly loaded over PCI-E. Because unified memory means the Apple GPUs can access all 512 GB at full speed.

My understanding is this is the advantage that’s pushing huge Mac Studio demand. Because it was the only way to give GPUs so much memory at price points anywhere near.

Yeah you can do way better once you’re in the 5 digits. But below that Apple had a specific advantage for some.

You're correct about some things but mostly wrong.

Yes, a Mac with 128GB+ will let you load some pretty big models.

However, you're still not going to be able to run them at usable speeds. Here are some M5 Max benchmarks on a Qwen 27B model w/ 290K context.... 12 tokens/sec output.

https://www.reddit.com/r/oMLX/comments/1swztoh/m5_max_128gb_...

And that's a 27B model. So yes, a M5 Max 128GB will let you load some pretty big models - can probably fit 120B in there with room left over for context. But the M5 Max still doesn't have the compute to make it practical, at least from an interactive usage standpoint - 120B dense model is going to be like an order of magnitude slower than 27B. You have to understand the computation going on here. LLMs are basically a huge many-to-many operation, and those operations themselves are pretty heavy.

So back to my previous post... you need three things. You need fast memory, you need a lot of it, and you need GPU compute with direct access to that fast memory. The M5 Max has like, 1.5 of the 3.

The M5 Ultra (if it ever exists) could kinda hit all 3, although actually getting your hands on one will be quite the lottery ticket.

   My understanding is this is the advantage that’s pushing huge Mac Studio demand.
This is true, but also, people who made this investment found that they're still not very usable for those HUGE models. Don't take my word for it though. Lots of benchmarks out there. r/localllama is pretty active too.
12 tok/s can absolutely be "usable output" depending on what you're doing. I agree though that the 27B dense model often feels slow due to an overall weakness of memory throughput on that particular platform. Most real-world 120B models though will be MoE-based with only a small fraction of active parameters, and these run quite well. Also, dense models can benefit from batching, which is at least marginally viable with Qwen if you stick to shorter contexts and smaller batches.
Even low-VRAM cards are actually very useful for running the comparatively smaller dense layers in large local MoE models. This only requires transfering very small amounts of data across the PCIe bus (similar to pipeline parallelism) so it fits nicely around the existing bottlenecks on that hardware.
It's also ECC ram but to be fair - yes quite overpriced. The RTX Pro line are basically what the Titan line used to be but way way more expensive.
> 5090 ($2k MSRP but realistically $3-3.5k)

These days, more like >$4.1K (at least in the US).

What should Apple do, in your view, to "embrace" it?
Mx Extreme = 2 x Mx Ultra = more cores. (Opportunity: processor chiplets could be designed to integrate in higher quantities.)

Increase RDMA cross-bar linking from 4x to 8x = a lotta ports, a switch, or a stacking interface.

Regular RAM size/speed scaling: 512GB -> 1TB Mac Studios. Wider RAM and RDMA paths * clocks.

Given the low power envelope of today's Mac Studios, and bandwidth limits, lots of room to scale up, if Apple chooses. My fantasy: 2x cores, 2x RAM sizes, 2x RDMA devices, 2-4x RAM & RMDA bandwidth.

DRAM optimized for CPU usage looks very different from DRAM optimized for GPU usage. You are leaving a lot performance on the table when you have a unified memory architecture. It makes sense in some situations, but it is not a silver bullet.
>[..] take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory.

I'm honestly a little confused by what you mean here. Why would we want to maximize those things? Games are about consistent output under the frame deadline, not full saturation of the hardware.

Why would anyone try to saturate a 5090 with their game? The addressable market is tiny and you'd have to hope their full spec runs as well as or better than your test rig or they'll still not hit framerate.

You could do some sort of adaptive quality where you spend time incrementally improving fidelity until your frame budget is up. In practice I think that might be trickier than it sounds, but I feel like theoretically there's something there that could get you the best graphics your rig can handle without dropping frames. I've been considering doing something like this when I've been building a game/engine lately.
There's only so high you can go because the game assets have a maximum quality. Maybe you'll be able to max out the 5090 but what about the next flagship GPU?

You're also likely not going to maximize all of bandwidth, compute, etc. because one of them will likely be your bottleneck. And it might be different depending on the GPU, too.

Most games are strictly scaled on resolution due to how deferred pipelines run. This is exactly the slider to max or not max everything on a gpu for games. The more pixels the more memory and the more compute.
If you're rendering at native resolution, which many PC gamers do, going higher isn't significantly better because it just helps with antialiasing via supersampling. There's no point rendering so much more pixels just because you can, that's just a waste of electricity.
The more pixels, the more compute of fragments but not necessarily more memory. A fragment might hit the same texel as an adjacent fragment.

Certainly not more from main memory, and maybe not more from the vram either depending on how the pipeline goes.

It's not a linear slider.

Memory safety is orthogonal to side-channels, and hardware-enforced isolation (e.g. IOMMU) is more powerful than compiler-enforced isolation (but both are good!)
Oh no now I have to worry about shaders row hammering my OS ram /s
You really do have to worry about that!
Isn't this how the Xbox 360 got hacked? Not necessarily rowhammer but other methods.. IIRC some shader code in King Kong was able to affect CPU execution or something like that.
And here I am with 128GB Strix Halo longingly eyeing the Blackwell cards that spit tokens 10-20x the speed.

The question is ultimate shape of knowledge compression and bandwidth optimization at which we arrive I suppose.

If you haven't already, check/increase the GPU memory carve-out on your UEFI.

More details: https://rocm.docs.amd.com/en/docs-7.2.0/how-to/system-optimi...

Currently utilizing 126GB GTT on a headless host
that link actually recommends not doing it from UEFI and doing it via software
That was the main reason for the big hype around Memristors 15 years ago. High density, high speed persistent memory to completely remove the need for hdd/ssds, potentially even removing the need for external memory altogether. So frustrating that it still seems like we're a long ways from that becoming reality. There's some renewed interest in Memristors as they can simulate neural network connections in models, so maybe the funding will return for it.
The one example of persistent memory that managed to reach the mass market was Intel Optane/3dXPoint (still popular today among people looking to save on RAM costs) and that used a kind of phase-change memory, which is but tangentially related to memristors. ReRAM is somewhat closer, but it's also been less successful so far.
Optane was still much slower than Ram. And not that much faster than NVME (theoretically)
Well, back in the day... The MacIIfx had video memory, ( dual ported ram ) that could be read and written to out of different ports. Wicked fast. It 486DX2s more than a year to catch up.
What is the difference between unified memory and shared memory?

Shared memory existed since the first CPU with an embedded GPU came to market and you could set in BIOS how much memory goes to what component.

I do have an opinion about how unified memory could be different, but I want a proper explanation.

I'm not sure everyone uses the terms consistently, but the difference is that the old "shared" memory was reserving a section to act as VRAM under the control of the GPU, ignored by the OS. The CPU ran the same kind of code pretending there is a "bus transfer" between host memory and graphics memory.

In unified memory, all the memory is host memory and data can go from program to GPU with zero copy movements. The addresses of buffers can be shared via appropriate MMU translation support, so that the application and graphics subsystem are communicating effectively through the basic RAM cache coherency protocols over the same buffers.

Edit to add: Aside from the zero copy transfer potential, it also means dynamic allocation strategies can shift the balance between host and graphics allocations on the fly. Individual image and message buffers can be allocated on the fly instead of setting a static split between the two worlds.

Reserved sounds like it would have been a better term now that I'm reading this many years later.
You got it in one! That's exactly what makes unified memory superior for current use cases, and different from the shared memory woes of old.
That's my understanding, or, maybe a better word would be "guess". The CPU telling the GPU: this is your memory now.
To some degree this is how it already feels like to program basically anything with dma today. You map hardware into an iommu and stop touching it when the hardware is supposed to use it, and then you reclaim it afterwards. So the model from the os feels the same, the difference is that it's not copying the memory into some local memory to operate on it.
Shared memory of the past meant reserving a part of the memory for the GPU, which could then not be used or accessed by the CPU. If the CPU wanted to access something, it had to copy it from the GPU's section of the memory to its own. Unified memory means both just fully share the same memory.
For these in specific, they appear basically transparently to the GPU. There's a lot of software/firmware stuff for this, but also a different hardware architecture - while the RAM is on the CPU die, the nvlink-c2c gives it extremely low latency and 600GB/s bandwidth between the GPU and CPU.
Marketing, mostly? But perhaps also more flexibility with how much memory the GPU can directly access without reserving it.
No. Let’s define terms, as others have pointed out they’re not perfect.

Unified memory is what Apple is doing, other phones do, and many low end built in GPUs have done in PCs for ages. There is only one physical memory pool. Both the CPU and GPU can access it at full speed.

This means no copying between pools of memory. No speed penalty accessing the CPU memory from GPU or vice versa. If the GPU only needs 2 GB to draw the desktop it only uses 2 GB of the pool. Or it can use 45 GB if it needs it and the CPU doesn’t. But all memory has to be the same speed, and that ain’t cheap given how fast GPUs like things. I don’t know if expandable memory is possible, and they use the same bus do they compete for bandwidth. Seems theoretically easier to program for to me.

The opposite is what’s been common in graphics cards since the 2D era. CPU and GPU have their own memory and can talk over PCI/AGP/PCI-E. This is what I think they mean by shared memory, if it’s not what’s the point in touting unified?

In this model if the GPU uses 2 GB of its 12 GB total, the other 10 isn’t available to the OS at full speed and I’m not aware of any operating systems that would use it for programs/cache by default. If the GPU needs 45 GB… too bad. You have to page things in and out of GPU memory over the much slower system bus. Starting a game means loading assets into main memory then transferring them to the GPU (newer tech can accelerate this). But the CPU can have slower memory than the GPU saving money. Memory expansion on the CPU side easy. And the CPU saturating its memory bus has no effect on the speed of the GPU memory bus because it’s physically separate. More complicated memory model but it’s the one everyone uses used to.

Which is better is a matter of opinion and workload needs.

Yes, I know there is an actual difference vs. dedicated GPUs with their own VRAM. I say it's marketing because Apple popularized the unified memory term even though, as you said, it existed in iGPUs long before Apple Silicon and was called shared GPU memory.

> I don’t know if expandable memory is possible

It technically is. These new systems (mostly) get their high bandwidth by using more channels (wider bus) of normal RAM modules. A system that has LPCAMM2 sockets should allow using the same LPDDR5X memory but you'd need a socket per two channels. A typical PC only supports two channels so having four (two sockets) would double the bandwidth.

Bandwidth by going wider, not faster. That makes sense.
System RAM has much lower bandwidth and less predictable access. Notably, the transfer from system to GPU is very slow. About 30x slower. LLMs aren’t designed to queue or parallelise operations to account for this. They just become much slower.
It’s also the reason, why you will never be able to repair or upgrade your computer in the future. From technological point of view these are indeed big advancements.

However, I couldn’t care less about faster CPU when:

1. It limits my ability to upgrade my system

2. Windows gets increasingly bloated and slower

LPCAMM2
The "one big drawback" is the lack of consumer upgrades, and the seemingly arbitrary prices charged by vendors for memory upgrades at time of system purchase. I'm not saying it has to be that way, but seems like it has been so far :-(
" Even local AI use cases don’t substantially or meaningfully benefit from faster memory, at least to average consumers."

What do you mean by this? Memory bandwidth is fundamental to the speed of an local AI model

While I'm a supporter of Rust, I have to point out that Rust's memory safety doesn't help against side-channel attacks.
Yeah, no. GDDR is functionally very different than SDRAM.

GDDR tries to push out as much bandwidth as possible, because that really matters for (traditional) GPU workloads. A constant but insignificant (= correctable) error rate is considered completely fine for GDDR, because that sacrifice allows the memory to be pushed much farther.

Meanwhile most (traditional) SDRAM workloads don't give a hoot about bandwidth but really care about latency. And ideally you want no errors, hence ECC RAM being so venerated.

If you unify memory, you're gonna have to choose to sacrifice one of those workloads or go suboptimal for both.

Weirdly enough this mostly matters for non-gaming workloads. The Apple M-series are absolute monsters in gaming, completely crushing the RTX XX90 editions in performance-per-watt, but as soon as memory bandwidth becomes paramount the M-series falls heavily behind.

> Even local AI use cases don’t substantially or meaningfully benefit from faster memory, at least to average consumers.

I'm not sure what you mean by this. Memory bandwidth is the main bottleneck for single-user decode. The bottleneck is actually more severe for end-user inference than cloud inference, because end users don't have the option to increase arithmetic intensity by computing tokens for multiple clients in the same pass.

One thing we've learned from Apple is the viability of spamming more LPDDR5X channels (up to 1024-bit total bus width on M3U) as a means of achieving high bandwidth while keeping the cost/capacity reasonable.

If this thing only has as much gpu bandwidth as the spark, it’s kinda pointles
Not true. This is aimed squarely at the Strix Halo and Mac markets. It's basically just strictly better than the Strix, and it's not clear cut vs that Macs in any sort of blanket statement.

My M5 Max 128gb MBP decodes faster than one of my Sparks, but the Spark's prefill is so much faster it can often answer the same query before the mac's prefill is finished. If you have large prompts, low cacheability, etc., a spark might be a very good options.

Not to mention you get can get two sparks and the MBP will be 85%+ of the cost at half the RAM.

I'm kind of tempted to pick one up. Leave running big models to my dual dgx setup, and all the misc. random stuff on an rtx.

Prefill will be a huge deal if batched unattended inference of SOTA models (on consumer platforms) becomes viable, because at that point it's the main remaining bottleneck. If running 30 inferences together boosts your decode throughput to 3x (that's consistent with some very rough experiments, though these haven't even looked at trying to mask SSD offload latency just yet), that's a 10x in total decode time but a 30x in total prefill time, because prefill workloads are fully compute bound already on consumer platforms and don't benefit from batching much at all.
Fair, but I don’t see what case you have w this. Mind sharing?

Seems niche to be both uncacheable and long context?

Anything where you're dealing with a large volume of records/documents. Lots of people are using these for large-scale digitization of documents - scanned stuff being OCR'ed and summarized, generating embeddings, etc. Large scale translation.

Anywhere where you might have a large backlog of data to work with can end up in this sort of situation.

> The Unified Memory pool is what will continue to be the “game changer” in systems architecture, especially outside of data centers.

The ps4 was the prime example of this, and how it could run so many great games.

Isn't the big drawback not having a swappable GPU? Perhaps that's not as important anymore but I'm not sure we've confirmed the market demand for that.
yeah, you only see double digits in performance degradation from going from pcie 5 to 3 with a 5090 (at x16 speed), with everything else its like in the single digits area.
And the thing we gamers forget is that we’re the outlier. We’re the edge case.

Most consumers will never really care about, let alone see, the difference in PCIe or memory bandwidth impacts from such a shift to unified memory pools. We might (being, at least in my case, a huge nerd), but I’m increasingly of the opinion that if modern blockbuster games are built for upscaling/reconstruction anyhow, then suddenly such sacrifices to performance seem acceptable relative to the gains in efficiency.

Well I mean, the idea with games is it all fits in vram. You really don't want to be thrashing. It's that things are still so slow that they must be avoided entirely, no?

No copy unified memory will help with that but you do pay the read speed costs.

gen3 is 16 years old.
This kind of post shows you have little idea why cpu and gpu are not sharing memory in the first place.
> The Unified Memory pool is the “game changer”

M1 knocking from 2020.

Gamed changed, past tense, six years ago. This is catch-up.

Hell, SGI O2s from 1996 had this. For all of the hype the performance gains were pretty modest.
FWIW, the O2's UMA let it handle far more textures than almost any other contemporary system with reasonable performance.

Most other SGIs had single or low double-digit megabytes of texture memory, whereas the O2 could host one gigabyte of unified memory and use a huge chunk of that for textures.

UMA was never about performance and it still isn't. Spark is slower than a 5090.
did they learn why? were there other gains?
O2 GPU was slower than other SGI options at the time, however it could use hilariously larger pool of memory without copying, which meant that O2 could use approaches that were punishingly hard (very tight transfer loops) or impossible (huge textures that couldn't be virtualized due to needing whole texture).

That was because unlike other GPUs at the time, O2's didn't have dedicated memory but shared the memory with CPU - way slower, but zero copies and bigger.

Arguably early home computers and workstations also used "unified memory" :D

FYI it existed long before that. Shared memory between CPU and iGPU has been a thing for a long time.
Zero-copy shared memory?
yes, here is 2013 AMD presentation of the topic as example: https://events.csdn.net/AMD/GPUSat%20-%20hUMA_june-public.pd... see slide 14 especially
Ah. Well, what kind of consumer hardware/software combo could I purchase to use this? outside of perhaps the... PS4?
Everything that doesn't have a discrete GPU has unified memory these days. If you're asking for something closer to the RTX Spark or Apple Silicon then look at AMD's Strix Halo systems.
Every AMD APU since introduction of HSA did it, which is how AMD ended up doing SoCs for PS4, PS5, and Xbox
I want unified but not uniform - everything can address anything, but you can add slower RAM to the system without requiring an entirely new chip. NUMA is cool.
AMD Fusion knocking from 2010.
Intel was doing UMA with their i740 graphics in the late 90s. Codename TIMNA was cancelled, but they pioneered it and used it on their you/cpu chips as well as their breakthrough 810 chipset that dominated graphics market for a decade. It was despised because it wa ubiquitous and a low performing graphics engine but games had to accommodate it.

Funny that it is getting credit only now.

SGI O2 was the famous "unified memory architecture" graphics system, two years before i740 that didn't really do UMA.

O2 was popular in systems where large textures or textures generated dynamically (like mapping external video input to texture) was important

> (which is good for Rust adherents, I figure).

As a Rust adherent, please do not put words in our mouths or set up unrealistic expectations for other people by linking together concepts at a very shallow level.

Language level memory safety has no answer for hardware security flaws which is what side channel attacks are. No programming language can provide memory privacy if another chip in your machine can read your memory. Just like no programming language can protect your application from a kernel vulnerability of the kernel it’s running on.

Damn. That wasn’t my intention at all, I was just pointing out that Rust has another reason to see wider adoption vis a vis the usual Valley advertising bullshit of deliberately conflating hardware security with software security. I personally give no fucks what something is written in, only that it’s written well enough that I don’t have to twist arms or babysit yet another sloppy piece of code in my enterprise.
But... it's rust.