| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by guntars 502 days ago
	Since it's a MoE model with 37B active params, I imagined you don't even need all of that ram to keep the whole model in memory, just the active bits.

1 comments

rahimnathwani 502 days ago

The active bits may change with each token. You need the whole model in memory, even though, for any single token, only a subset of that memory will have been used in computation. The memory efficiency comes when you have multiple sessions in parallel.

link