| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by phree_radical 806 days ago
	Very incorrect! The "8x7b" in the name regularly confuses people into some similar conclusion, but there are not eight 7b "experts" in Mixtral 8x. It's more apt to think of all 256 FFN's as the "experts," as each expert FFN on a given layer has no relation to the expert FFN's on other layers. You need to train them all within the MoE architecture, as combining existing models ("clown car MoE") works, but isn't gaining anything from the architecture/sparsity

1 comments

epups 806 days ago

Sorry, could you expand on this a bit further? Are you saying that for a MoE, you want to train the exact same model, and then just finetune the feed forward networks differently for each of them? And you're saying that separately training 8 different models would not be efficient - do we have evidence for that?

link