| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by snovv_crash 508 days ago
	No, because the bottleneck is RAM bandwidth. This is already quantized and otherwise is essentially random so can't be compressed in any meaningful way.

1 comments

menaerus 508 days ago

How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.

link

snovv_crash 507 days ago

For non-MoE models, it needs to flow the entire model through the CPU. So if it is a 32B parameter model quantised to 8b/parameter, that is 32GB of RAM bandwidth per token. If your RAM does 64GB/s that is 2 tok/s.

link

menaerus 507 days ago

I didn't get the impression that the math around it is that simplistic. The first obvious reason I can think of now is the attention mechanism being used. Both GQA and MQA demand less compute and therefore less bandwidth than MHA.

link

ryao 507 days ago

How big are the active weights? That how much bandwidth you need per second per token.

link