I understand that for each token mixtral will only need two (of eight) submodels. I wonder if there is temporal locality and an LRU caching schema could be used.
It is two out of eight at each layer, with 32 layers independent of each other. There are no eight "sub-models".
However, this raises a question: could a slightly more complex router use output layer n-1 to choose experts for layer n+1 (vs n and n+1 today)?
This way, there is more time to load the needed experts for the n+1 layer.
However, this raises a question: could a slightly more complex router use output layer n-1 to choose experts for layer n+1 (vs n and n+1 today)? This way, there is more time to load the needed experts for the n+1 layer.