Hacker News new | ask | show | jobs
by Imnimo 471 days ago
I wonder if having a big mixture of experts isn't all that valuable for the type of tasks in math and coding benchmarks. Like my intuition is that you need all the extra experts because models store fuzzy knowledge in their feed-forward layers, and having a lot of feed-forward weights lets you store a longer tail of knowledge. Math and coding benchmarks do sometimes require highly specialized knowledge, but if we believe the story that the experts specialize to their own domains, it might be that you only really need a few of them if all you're doing is math and coding. So you can get away with a non-mixture model that's basically just your math-and-coding experts glued together (which comes out to about 32B parameters in R1's case).
2 comments

MoE is likely temporary, local optimum now that resembles bitter lesson path. With the time we'll likely distill what's important, shrink it and keep it always active. There may be some dynamic retrieval of knowledge (but not intelligence) in the future but it probably won't be anything close to MoE.
...let me expand a bit.

It would be interesting if research teams would try to collapse trained MoE into JoaT (Jack of all Trades - why not?).

With MoE architecture it should be efficient to back propagate other expert layers to align with result of selected one – at end changing multiple experts into multiple Jacks.

Having N multiple Jacks at the end is interesting in itself as you may try to do something with commonalities that are present, available on completely different networks that are producing same results.

> , but if we believe the story that the experts specialize to their own domains

I don't think we should believe anything like that.