Hacker News new | ask | show | jobs
by phree_radical 806 days ago
Very incorrect! The "8x7b" in the name regularly confuses people into some similar conclusion, but there are not eight 7b "experts" in Mixtral 8x. It's more apt to think of all 256 FFN's as the "experts," as each expert FFN on a given layer has no relation to the expert FFN's on other layers. You need to train them all within the MoE architecture, as combining existing models ("clown car MoE") works, but isn't gaining anything from the architecture/sparsity
1 comments

Sorry, could you expand on this a bit further? Are you saying that for a MoE, you want to train the exact same model, and then just finetune the feed forward networks differently for each of them? And you're saying that separately training 8 different models would not be efficient - do we have evidence for that?