How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.
For non-MoE models, it needs to flow the entire model through the CPU. So if it is a 32B parameter model quantised to 8b/parameter, that is 32GB of RAM bandwidth per token. If your RAM does 64GB/s that is 2 tok/s.
I didn't get the impression that the math around it is that simplistic. The first obvious reason I can think of now is the attention mechanism being used. Both GQA and MQA demand less compute and therefore less bandwidth than MHA.