| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by menaerus 545 days ago
	How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.

2 comments

snovv_crash 545 days ago

For non-MoE models, it needs to flow the entire model through the CPU. So if it is a 32B parameter model quantised to 8b/parameter, that is 32GB of RAM bandwidth per token. If your RAM does 64GB/s that is 2 tok/s.

link

menaerus 544 days ago

I didn't get the impression that the math around it is that simplistic. The first obvious reason I can think of now is the attention mechanism being used. Both GQA and MQA demand less compute and therefore less bandwidth than MHA.

link

ryao 545 days ago

How big are the active weights? That how much bandwidth you need per second per token.

link