Hacker News new | ask | show | jobs
by sroussey 809 days ago
The mixture of experts is kinda like a team and a manager. So the manager and one or two of the team go to work depending on the input, not the entire team.

So in this analogy, each team member and the manager has a certain number of params. The whole team is 132B. The manager and team members running for the specific input add up to 36B. Those will load into memory.