Hacker News new | ask | show | jobs
by whimsicalism 809 days ago
It is still just one token going through the model.

I actually think mixture-of-expert is a bit of a misnomer, the 'experts' do not really necessarily have super distinct expertise. Think of it more as how neurons activate in the brain - your entire brain doesn't light up for every query, now in neural networks the same thing happens (it doesn't fully light up for every query).

Don't really know a resource besides the seminal Noam Shazeer paper, sorry - I'm sure others have higher-level.