Hacker News new | ask | show | jobs
by digdugdirk 808 days ago
See, this is where my understanding of LLMs breaks down. I can understand one token going through the model, but I can't understand a model that has different "experts" internally.

Do you have any resources or links to help explain that concept?

2 comments

The "mixture of experts" goal is to add more parameters to the model to make it more powerful, without requiring any more compute. The way this is done is by having sections of the model ("experts") that are in parallel with each other, and each token only going through one of them. Think of it like a multi-lane highway with a toll booth on each lane - each car only drives on one lane rather than using them all, so only pays one toll.

The name "experts" is a bit misleading, since each expert ("highway lane") is not really specialized in any obviously meaningful way. There is a routing/gating component in front of the experts that chooses on a token by token basis (not sentence by sentence!) which "expert" to route the token to, with the goal of roughly load balancing between the experts so that they all see the same number of tokens, and the parameters in each expert are therefore all equally utilized.

The fact that the tokens in a sentence will be somewhat arbitrarily sent through different "experts" makes it an odd kind of expertise - not directly related to the sentence as a whole! There has been experimentation with a whole bunch of routing (expert selection) schemes.

It is still just one token going through the model.

I actually think mixture-of-expert is a bit of a misnomer, the 'experts' do not really necessarily have super distinct expertise. Think of it more as how neurons activate in the brain - your entire brain doesn't light up for every query, now in neural networks the same thing happens (it doesn't fully light up for every query).

Don't really know a resource besides the seminal Noam Shazeer paper, sorry - I'm sure others have higher-level.