| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by snovv_crash 545 days ago
	For non-MoE models, it needs to flow the entire model through the CPU. So if it is a 32B parameter model quantised to 8b/parameter, that is 32GB of RAM bandwidth per token. If your RAM does 64GB/s that is 2 tok/s.

1 comments

menaerus 545 days ago

I didn't get the impression that the math around it is that simplistic. The first obvious reason I can think of now is the attention mechanism being used. Both GQA and MQA demand less compute and therefore less bandwidth than MHA.

link