| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nik736 248 days ago
	This is only the base model, no upgrades yet for the Pro/Max version. The memory bandwidth is 153GB/s which is not enough to run viable open source LLM models properly.

9 comments

wizee 248 days ago

153 GB/s is not bad at all for a base model; the Nvidia DGX Spark has only 273 GB/s memory bandwidth despite being billed as a desktop "AI supercomputer".

Models like Qwen 3 30B-A3B and GPT-OSS 20B, both quite decent, should be able to run at 30+ tokens/sec at typical (4-bit) quantizations.

link

zamadatix 248 days ago

Even at 1.8x the base memory bandwidth and 4x the memory capacity Nvidia spent a lot of time talking about how you can pair two DGXs together with the 200G NIC to be able to slowly run quantized versions of the models everyone was actually interested in.

Neither product actually qualifies for the task IMO, and that doesn't change just because two companies advertised them as such instead of just one. The absolute highest end Apple Silicon variants tend to be a bit more reasonable, but the price advantage goes out the window too.

link

cma 248 days ago

M5 says 3X thunderbolt 5, should be able to do 240G bidirectional in total. Not that useful yet with max 32GB of RAM though.

link

mrheosuper 247 days ago

my M1 pro has over 200GB/s ram speed. 5 Years later it's reasonable to expect the base cpu reach that speed.

link

replete 248 days ago

Looks like the M5 base has LPDDR5x-9600, which works out to 153.6 from base M4's 120GB/s DDR5x-7500. The Pro/Max versions have more memory controllers, 16, 24 and 32 channels accordingly. The 32 channel M5 top-end version will have 614GB/s by my calculations.

It would take 48 channels of DDR5x-9600 to match a 3090's memory bandwidth, so the situation is unlikely to change for a couple of years when DDR6 arrives I guess

link

mpeg 248 days ago

The memory capacity to me is an even bigger problem, at 32GB max.

link

sgt 248 days ago

That'll come in the MacBook Pro etc cycle, like last time, then you'll have 512GB RAM

link

bombcar 248 days ago

Is the M4 Ultra even out yet? I can't see anything with 512 GB but the M3 Ultra on the Mac Studio (for a cool $4000 more).

link

asimovDev 248 days ago

i am interested in seeing if they skip m4 and go straight to M5 and only make that available in the Pro. From my unscientific observations it seems that chips are running hotter and hotter, I wouldn't be surprised if M5 Ultra would struggle in a Studio and would require cooling performance of the Mac Pro case

link

mpeg 248 days ago

Same with bandwidth though, usually pro/max memory has much higher speed

link

andy_ppp 248 days ago

Yes the M4 Base has 120 GB/s, Pro 273 GB/s and Max has 546 GB/s... That means M5 Pro is potentially around 348 GB/s and M5 Max is almost at 700 GB/s - for comparison a 4090 has around 1,000 GB/s. So pretty incredible!

link

sgt 248 days ago

Also I think even an M3 Ultra is more cost effective at running LLMs than 4090 or 5090. Mostly due to being more energy efficient. And less fragile than running a gamer PC build.

link

andy_ppp 248 days ago

It can run larger models quite slowly but lacks matmul acceleration (included in the M5) that is very useful for context and prompt performance at inference time. I will probably burn my budget with an M5 Max with 256gb (maybe even 512gb) memory, the price will be upsetting but I guess that is life!

link

replete 248 days ago

I think the M5 Max will be more like 614GB/s, unless they somehow have exceeded DDR5x-9600 or added more than 32 memory controllers

link

andy_ppp 247 days ago

DDR5-9600 is 153GB/s from a single channel, Max has 4 channels… these are all theoretical values of course - real world none of these, even the graphics card will get that near to those… so not sure what you’re saying.

link

iyn 248 days ago

Yeah, that's my main bottleneck too. Constantly at 90%+ RAM utilization with my 64GiB (VMs, IDEs etc.). Hoping to go with at least 128GiB (or more) once M5 Max is released.

link

czbond 248 days ago

I am interested to learn why models move so much data per second. Where could I learn more that is not a ChatGPT session?

link

Sohcahtoa82 248 days ago

Models are made of "parameters" which are really weights in a large neural network. For each token generated, each parameter needs to take its turn inside the CPU/GPU to be calculated.

So if you have a 7B parameter model with 16-bit quantization, that means you'll have 14 GB/s of data coming in. If you only have 153 GB/sec of memory bandwidth, that means you'll cap out ~11 tokens/sec, regardless of how my processing power you have.

You can of course quantize to 8-bit or even 4-bit, or use a smaller model, but doing so makes your model dumber. There's a trade-off between performance and capability.

link

adastra22 248 days ago

I think you mean GB/token

link

Sohcahtoa82 248 days ago

Err...yup. My bad. Can't edit it now.

link

shorts_theory 248 days ago

You might be interested in LLM Systems which talks about how LLMs work at the hardware level and what optimizations can be done to improve the efficiency of them in this course: https://llmsystem.github.io/llmsystem2025spring/

link

modeless 248 days ago

The models (weights and activations and caches) can fill all the memory you have and more, and to a first (very rough) approximation every byte needs to be accessed for each token generated. You can see how that would add up.

I highly recommend Andrej Karpathy's videos if you want to learn details.

link

pfortuny 248 days ago

A very simplified version is: you need all the matrix to compute a matrix x vector operation, even if the vector is mostly zeroes. Edit: obviously my simplification is wrong but if you add up compression, etc… you get an idea.

link

rs186 248 days ago

Would you mind specifying which video(s)? He has quite a lot of content to consume.

link

hu3 248 days ago

Enough or not, they do describe it like this in an image caption:

"M5 is Apple’s next-generation system on a chip built for AI, resulting in a faster, more efficient, and more capable chip for the 14-inch MacBook Pro, iPad Pro, and Apple Vision Pro."

link

diabllicseagull 248 days ago

You don’t want to be bandwidth-bound, sure. But it all depends on how much compute power you have to begin with. 153GB/s is probably not enough bandwidth for an Rtx5090. But for the entry laptop/tablet chip M5? It’s likely plenty.

link

chedabob 248 days ago

My guess would be those are going into the rumoured OLED models coming out next year.

link

Tepix 248 days ago

With MoE LLMs like Qwen 3 30B-A3B that's no longer true.

link

quest88 248 days ago

What do you mean by properly? What’s the behavior one would observe if they did run an llm?

link

burnte 248 days ago

"Properly" means at some arbitrary speed that the writer would describe as "fast" or "fast enough". If you have a lower demand for speed they'll run fine.

link

nik736 248 days ago

If you have enough memory to load a model, but not enough bandwidth to handle it, you will get a very low token/s output.

link

Rohansi 248 days ago

You can also have enough bandwidth but be compute limited and get lower performance than expected. This is more likely to be the case for Apple Silicon vs. high power GPUs.

link