Hacker News new | ask | show | jobs
by joshhart 905 days ago
If you are making many requests in batch this works ok because you can shuffle the next layer in while the current one is processing a set of matrix multiplies. This takes it from being a memory bound problem to a flops bound problem. This really only works if you care about throughput and not latency.
1 comments

I understand that for each token mixtral will only need two (of eight) submodels. I wonder if there is temporal locality and an LRU caching schema could be used.
It is two out of eight at each layer, with 32 layers independent of each other. There are no eight "sub-models".

However, this raises a question: could a slightly more complex router use output layer n-1 to choose experts for layer n+1 (vs n and n+1 today)? This way, there is more time to load the needed experts for the n+1 layer.