Hacker News new | ask | show | jobs
by snovv_crash 508 days ago
No, because the bottleneck is RAM bandwidth. This is already quantized and otherwise is essentially random so can't be compressed in any meaningful way.
1 comments

How much bandwidth do we actually need per-token generation? Let's take one open-source model as a starting point since not all models are created the same.
For non-MoE models, it needs to flow the entire model through the CPU. So if it is a 32B parameter model quantised to 8b/parameter, that is 32GB of RAM bandwidth per token. If your RAM does 64GB/s that is 2 tok/s.
I didn't get the impression that the math around it is that simplistic. The first obvious reason I can think of now is the attention mechanism being used. Both GQA and MQA demand less compute and therefore less bandwidth than MHA.
How big are the active weights? That how much bandwidth you need per second per token.