Hacker News new | ask | show | jobs
by cjbprime 798 days ago
It's a very plausible rumor, but it is misleading in this context, because the rumor also states that it's a mixture of experts model with 8 experts, suggesting that most (perhaps as many as 7/8) of those weights are unused by any particular inference pass.

That might suggest that GPT-4 should be thought of as something like a 250B model. But there's also some selection for the remaining 1/8 of weights that are used by the chosen expert as being the "most useful" weights for that pass (as chosen/defined by the mixture routing), so now it feels like 250B is undercounting the parameter size, whereas 1.8T was overcounting it.

I think it's not really defined how to compare parameter counts with a MoE model.

3 comments

But from an output quality standpoint the total parameter count still seems more relevant. For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models, which tracks with its total size of ~45B parameters. You get some of the training and inference advantages of a 13B model, with the strength of a 45B model.

Similarly, if GPT-4 is really 1.8T you would expect it to produce output of similar quality to a comparable 1.8T model without MoE architecture.

"For example 8x7B Mixtral only executes 13B parameters per token, but it behaves comparable to 34B and 70B models"

Are you sure about that? I'm pretty sure Miqu (the leaked Mistral 70b model) is generally thought to be smarter than Mixtral 8x7b.

What is the reason for settling on 7/8 experts for mixture of experts? Has there been any serious evaluation of what would be a good MoE split?
It's not always 7-8.

From Databricks: "DBRX has 16 experts and chooses 4, while Mixtral and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments."

https://www.databricks.com/blog/introducing-dbrx-new-state-a...

A 19" server chassis is wide enough for 8 vertically mounted GPUs next to each other, with just enough space left for the power supplies. Consequently 8 GPUs is a common and cost efficient configuration in servers.

Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.

You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.

(For a model of GPT-4's size, it could also be 8 nodes with several GPUs each, each node comprising a single expert.)
I think its almost certainly using at least two experts per token. It helps a lot during training to have two experts to contrast when putting losses on the expert router.