| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by phamilton 4 days ago
	Generation is basically just memory bandwidth math. Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.

1 comments

SlavikCA 4 days ago

And with MTP (or other speculation techniques) you can ~double that.

link

phamilton 3 days ago

MTP on a MoE is hit or miss. If you're bottlenecked on memory, MTP can increase the number of active experts (like any batch processing would), which can eat away gains from it.

link