| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gatienboquet 478 days ago

LLMs are primarily "memory-bound" rather than "compute-bound" during normal use.

The model weights (billions of parameters) must be loaded into memory before you can use them.

Think of it like this: Even with a very fast chef (powerful CPU/GPU), if your kitchen counter (VRAM) is too small to lay out all the ingredients, cooking becomes inefficient or impossible.

Processing power still matters for speed once everything fits in memory, but it's secondary to having enough VRAM in the first place.

1 comments

whimsicalism 478 days ago

Transformers are typically memory-bandwidth bound during decoding. This chip is going to have a much worse memory b/w than the nvidia chips.

My guess is that these chips could be compute-bound though given how little compute capacity they have.

link

Gracana 478 days ago

It's pretty close. A 3090 or 4090 has about 1TB/s of memory bandwidth, while the top Apple chips have a bit over 800GB/s. Where you'll see a big difference is in prompt processing. Without the compute power of a pile of GPUs, chewing through long prompts, code, documents etc is going to be slower.

link

whimsicalism 478 days ago

nobody in industry is using a 4090, they are using H100s which have 3TB/s. Apple also doesn’t have any equivalent to nvlink.

I agree that compute is likely to become the bottleneck for these new Apple chips, given they only have like ~0.1% the number of flops

link

Gracana 478 days ago

I chose the 3090/4090 because it seems to me that this machine could be a replacement for a workstation or a homelab rig at a similar price point, but not a $100-250k server in a datacenter. It's not really surprising or interesting that the datacenter GPUs are superior.

FWIW I went the route of "bunch of GPUs in a desktop case" because I felt having the compute oomph was worth it.

link

_zoltan_ 478 days ago

4.8TB/s on H200, 8TB/s on B200, pretty insane.

link

Gracana 478 days ago

That’s wild, somehow I hadn’t seen the B200 specs before now. I wish I could have even a fraction of that!

link

gatienboquet 478 days ago

VRAM capacity is the initial gatekeeper, then bandwidth becomes the limiting factor.

link

whimsicalism 478 days ago

i suspect that compute actually might be the limiter for these chips before b/w, but not certain

link

cubefox 478 days ago

> Transformers are typically memory-bandwidth bound during decoding.

Not in case of language models, which are typically bound by memory size rather than bandwidth.

link

whimsicalism 478 days ago

nope

link

cubefox 478 days ago

I assume even this one won't run on an RTX 5090 due to constrained memory size: https://news.ycombinator.com/item?id=43270843

link

whimsicalism 478 days ago

sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs

link

cubefox 478 days ago

Why are industry setups considered actual while others are not?

link