Hacker News new | ask | show | jobs
by menaerus 498 days ago
How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.
2 comments

For non-MoE models, it needs to flow the entire model through the CPU. So if it is a 32B parameter model quantised to 8b/parameter, that is 32GB of RAM bandwidth per token. If your RAM does 64GB/s that is 2 tok/s.
I didn't get the impression that the math around it is that simplistic. The first obvious reason I can think of now is the attention mechanism being used. Both GQA and MQA demand less compute and therefore less bandwidth than MHA.
How big are the active weights? That how much bandwidth you need per second per token.