Hacker News new | ask | show | jobs
by bjornsing 814 days ago
Means that it’s a mixture of experts model with 132B parameters in total, but a subset of 36B parameters are used / selected in each forward pass, depending on the context. The parameters not used / selected for generating a particular token belong to “experts” that were deemed not very good at predicting the next token in the current context, but could be used / selected e.g. for the next token.
1 comments

Do the 132B params need to be loaded in GPU memory, or only the 36B?
For efficiency, 132B.

That way, at inference-time you get the speed of 36B params because you are only "using" 36B params at a time, but the next token might (and frequently does) need a different set of experts than the one before it. If that new set of experts is already loaded (ie you preloaded them into GPU VRAM with the full 132B params), there's no overhead, and you just keep running at 36B speed irrespective of the loaded experts.

You could theoretically load in 36B at a time, but you would be severely bottlenecked by having to reload those 36B params, potentially for every new token! Even on top of the line consumer GPUs that would slow you down to ~seconds per token instead of tokens per second :)