|
|
|
|
|
by swyx
713 days ago
|
|
thanks for the thoughtful qtn! yeah i dont have the background for this one, you'll have to ask an MoE researcher (which Yi is not really either as i found out on the pod). it does make sense that on a param-for-param basis MoEs would have less compositional abilities, but i have yet to read a paper (mostly my fault, bc i dont read MoE papers that closely, but also researcher fault, in that they're not incentivized to be rigorous about downsides of MoEs) that really identified what these compositional abilities are that MoEs are affected. if you could, for example, identify subcategories of BigBench or similar that require compositional abilities, then we might be able to get hard evidence on this question. i'm not yet motivated enough to do this myself but it'd make a decent small research question. HOWEVER i do opine that MoEs are kiiind of a stopgap (both on the pod and on https://latent.space/p/jan-2024) - definitely a validated efficiency/sparsity technique (esp see deepseek's moe work if you havent already, with >100 experts https://buttondown.email/ainews/archive/ainews-deepseek-v2-b...) but mostly a oneoff boost you get on the single small dense expert equivalent model rather than comparable to the capabilities of a large dense model of the same param count (aka I expect a 8x22B MoE to never outperform a 176B dense model ceteris paribus - which is difficult to get a like-for-like comparison on bc these things are expensive, also partially because usually the MoE is just upcycled instead of trained from scratch, and partially because the routing layer is deepening every month). so perhaps to TLDR there is more than enough evidence and practical motivation to justify their continued use (i would go so far as to say that all inference endpoints incl gpt4 and above should be MoEs) but they themselves are not really an architectural decision that matters for the next quantum leap in capabilities |
|