|
|
|
|
|
by digdugdirk
808 days ago
|
|
See, this is where my understanding of LLMs breaks down. I can understand one token going through the model, but I can't understand a model that has different "experts" internally. Do you have any resources or links to help explain that concept? |
|
The name "experts" is a bit misleading, since each expert ("highway lane") is not really specialized in any obviously meaningful way. There is a routing/gating component in front of the experts that chooses on a token by token basis (not sentence by sentence!) which "expert" to route the token to, with the goal of roughly load balancing between the experts so that they all see the same number of tokens, and the parameters in each expert are therefore all equally utilized.
The fact that the tokens in a sentence will be somewhat arbitrarily sent through different "experts" makes it an odd kind of expertise - not directly related to the sentence as a whole! There has been experimentation with a whole bunch of routing (expert selection) schemes.