|
|
|
|
|
by mirekrusin
473 days ago
|
|
MoE is likely temporary, local optimum now that resembles bitter lesson path. With the time we'll likely distill what's important, shrink it and keep it always active. There may be some dynamic retrieval of knowledge (but not intelligence) in the future but it probably won't be anything close to MoE. |
|
It would be interesting if research teams would try to collapse trained MoE into JoaT (Jack of all Trades - why not?).
With MoE architecture it should be efficient to back propagate other expert layers to align with result of selected one – at end changing multiple experts into multiple Jacks.
Having N multiple Jacks at the end is interesting in itself as you may try to do something with commonalities that are present, available on completely different networks that are producing same results.