| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joshhart 905 days ago
	If you are making many requests in batch this works ok because you can shuffle the next layer in while the current one is processing a set of matrix multiplies. This takes it from being a memory bound problem to a flops bound problem. This really only works if you care about throughput and not latency.

1 comments

gpderetta 905 days ago

I understand that for each token mixtral will only need two (of eight) submodels. I wonder if there is temporal locality and an LRU caching schema could be used.

link

Kubuxu 905 days ago

It is two out of eight at each layer, with 32 layers independent of each other. There are no eight "sub-models".

However, this raises a question: could a slightly more complex router use output layer n-1 to choose experts for layer n+1 (vs n and n+1 today)? This way, there is more time to load the needed experts for the n+1 layer.

link