|
|
|
|
|
by kmacdough
471 days ago
|
|
I understand the principles of MOE, but clearly not enough to make full sense of this. Does each expert within R1 have 37B parameters? If so, is QwQ only truly competing against one expert in this particular benchmark? Generally I don't think I follow how MOE "selects" a model during training or usage. |
|
Instead, the mixture of experts exists within individual layers. Suppose we want to have a big feed-forward layer that takes as input a 1024-element vector, has a hidden size of 8096, and an output size of 1024. We carve up that 8096 hidden layer into 8 1024-sized chunks (this does not have to be the same size as the input). Whenever an input arrives at this layer, a routing function determines which of those 1024-sized chunks should serve as the hidden layer. Every token within a single prompt/response can choose a different chunk when it is processed by this layer, and every layer can have a different routing decision. So if I have 100 layers, each of which has 8 experts, there are 8^100 possible different paths that an individual token could take through the network.