| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Kubuxu 905 days ago
	It is two out of eight at each layer, with 32 layers independent of each other. There are no eight "sub-models". However, this raises a question: could a slightly more complex router use output layer n-1 to choose experts for layer n+1 (vs n and n+1 today)? This way, there is more time to load the needed experts for the n+1 layer.