| HN Mirror

The "mixture of experts" goal is to add more parameters to the model to make it more powerful, without requiring any more compute. The way this is done is by having sections of the model ("experts") that are in parallel with each other, and each token only going through one of them. Think of it like a multi-lane highway with a toll booth on each lane - each car only drives on one lane rather than using them all, so only pays one toll.

The name "experts" is a bit misleading, since each expert ("highway lane") is not really specialized in any obviously meaningful way. There is a routing/gating component in front of the experts that chooses on a token by token basis (not sentence by sentence!) which "expert" to route the token to, with the goal of roughly load balancing between the experts so that they all see the same number of tokens, and the parameters in each expert are therefore all equally utilized.

The fact that the tokens in a sentence will be somewhat arbitrarily sent through different "experts" makes it an odd kind of expertise - not directly related to the sentence as a whole! There has been experimentation with a whole bunch of routing (expert selection) schemes.