| author here! thanks for submitting - if you dont mind i will copy over some personal highlights from my twiter post (https://x.com/latentspacepod/status/1809300018907828285): Special interest topics we touched on: - why encoder-decoder is not actually that different than decoder-only architecture (Reka is notably enc-dec vs GPT which is dec-only) - why the "Noam architecture" is All You Need - The chaotic vs stable periods of spinning up GPU clusters ( per this great post I submitted https://news.ycombinator.com/item?id=39609997 ) - The NeurIPS Mirage paper vs Jason/Yi's Emergent Abilities paper - The Efficiency Misnomer - reminder on Arvind Narayanan's recent callout that "active params" isn't actually the same thing as lower cost, and also not the same thing as faster inference - Echoing Founders Fund's skepticism that Open Source AI can have any real lasting impact please AMA! (Yi will see this) |
The practical motivation for MoEs is very clear but I do worry about loss of compositional abilities (that I think just emerge from superposed representations?) that some tasks may require, especially with the many experts phenomenon we're seeing. This is an observation from smaller MoE models (with like top-k gating etc.) that may or may not scale, that denser models trained to the same loss tend to perform complex tasks "better".
Intuitively, do you think MoEs are just another stopgap trick we're using while we figure out more compute, better optimizers or could there be enough theoretical motivation to justify their continued use? If there isn't, perhaps we need to at least figure out "expert scaling laws" :)