|
|
|
|
|
by Kubuxu
905 days ago
|
|
It is two out of eight at each layer, with 32 layers independent of each other. There are no eight "sub-models". However, this raises a question: could a slightly more complex router use output layer n-1 to choose experts for layer n+1 (vs n and n+1 today)?
This way, there is more time to load the needed experts for the n+1 layer. |
|