Hacker News new | ask | show | jobs
by idiliv 806 days ago
In Mixtral 8x7B, the 8 means that the model uses Mixture-of-Experts (MoE) layers with 8 experts. The 7B means that if you were to remove 7 of the 8 experts in each layer, then you would end up with a 7B model (which would have exactly the same architecture as Mistral 7B). Therefore, a 1x7B model has 7B params. An 8x7B model has 1 * 7B + (8-1) * sz_expert params, where sz_expert is some constant value that the MoE layers increase by when adding one expert. In the case of Mixtral 8x7B the model size is 46.3GB, so, sz_expert ≈ 5.6B.

If these assumptions port over to 8x22B, then 8x22B has, at 281GB, sz_expert ≈ 13.8B.

2 comments

I tried to check this for myself.

I agreed for the first one, (46.3 - 7) / 7 = 5.61b.

The second one doesn't match up, (281 - 22) / 7 = 37b or (140.5 - 22) / 7 = 16.92b. Am I doing something wrong?

Just tried this again and I also arrive at 16.92B. Not sure what I did wrong the first time, thanks for double-checking this!
Oh, and to answer your actual question: Assuming that the model is released with 16 bits per parameter, then it as 281GB / 16 bit = 140.5 parameters.