| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by worldsayshi 797 days ago
	What is the reason for settling on 7/8 experts for mixture of experts? Has there been any serious evaluation of what would be a good MoE split?

2 comments

nycdatasci 797 days ago

It's not always 7-8.

From Databricks: "DBRX has 16 experts and chooses 4, while Mixtral and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments."

https://www.databricks.com/blog/introducing-dbrx-new-state-a...

link

wongarsu 796 days ago

A 19" server chassis is wide enough for 8 vertically mounted GPUs next to each other, with just enough space left for the power supplies. Consequently 8 GPUs is a common and cost efficient configuration in servers.

Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.

You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.

link

cjbprime 796 days ago

(For a model of GPT-4's size, it could also be 8 nodes with several GPUs each, each node comprising a single expert.)

link