Hacker News new | ask | show | jobs
by phamilton 4 days ago
Generation is basically just memory bandwidth math.

Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.

1 comments

And with MTP (or other speculation techniques) you can ~double that.
MTP on a MoE is hit or miss. If you're bottlenecked on memory, MTP can increase the number of active experts (like any batch processing would), which can eat away gains from it.