| HN Mirror

It is two out of eight at each layer, with 32 layers independent of each other. There are no eight "sub-models".

However, this raises a question: could a slightly more complex router use output layer n-1 to choose experts for layer n+1 (vs n and n+1 today)? This way, there is more time to load the needed experts for the n+1 layer.