Hacker News new | ask | show | jobs
by whimsicalism 478 days ago
Transformers are typically memory-bandwidth bound during decoding. This chip is going to have a much worse memory b/w than the nvidia chips.

My guess is that these chips could be compute-bound though given how little compute capacity they have.

3 comments

It's pretty close. A 3090 or 4090 has about 1TB/s of memory bandwidth, while the top Apple chips have a bit over 800GB/s. Where you'll see a big difference is in prompt processing. Without the compute power of a pile of GPUs, chewing through long prompts, code, documents etc is going to be slower.
nobody in industry is using a 4090, they are using H100s which have 3TB/s. Apple also doesn’t have any equivalent to nvlink.

I agree that compute is likely to become the bottleneck for these new Apple chips, given they only have like ~0.1% the number of flops

I chose the 3090/4090 because it seems to me that this machine could be a replacement for a workstation or a homelab rig at a similar price point, but not a $100-250k server in a datacenter. It's not really surprising or interesting that the datacenter GPUs are superior.

FWIW I went the route of "bunch of GPUs in a desktop case" because I felt having the compute oomph was worth it.

4.8TB/s on H200, 8TB/s on B200, pretty insane.
That’s wild, somehow I hadn’t seen the B200 specs before now. I wish I could have even a fraction of that!
VRAM capacity is the initial gatekeeper, then bandwidth becomes the limiting factor.
i suspect that compute actually might be the limiter for these chips before b/w, but not certain
> Transformers are typically memory-bandwidth bound during decoding.

Not in case of language models, which are typically bound by memory size rather than bandwidth.

nope
I assume even this one won't run on an RTX 5090 due to constrained memory size: https://news.ycombinator.com/item?id=43270843
sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs
Why are industry setups considered actual while others are not?