|
|
|
|
|
by whimsicalism
809 days ago
|
|
It is still just one token going through the model. I actually think mixture-of-expert is a bit of a misnomer, the 'experts' do not really necessarily have super distinct expertise. Think of it more as how neurons activate in the brain - your entire brain doesn't light up for every query, now in neural networks the same thing happens (it doesn't fully light up for every query). Don't really know a resource besides the seminal Noam Shazeer paper, sorry - I'm sure others have higher-level. |
|