|
|
|
|
|
by zamadatix
435 days ago
|
|
The only thing about this which may be unintuitive from the name is an "Expert" is not something like a sub-llm that's good at math and gets called when you ask a math question. Models like this have layers of networks they run tokens through and each layer is composed of 256 sub-networks, any of which can be selected (or multiple selected and merged in some way) for each layer independently. So the net result is the same: sets of parameters in the model are specialized and selected for certain inputs. It's just a done a bit deeper in the model than one may assume. |
|
I think where MoE is misleading is that the experts aren't what we would call "experts" in the normal world but rather they are experts for a specific token. that concept feels difficult to grasp.