| HN Mirror

...let me expand a bit.

It would be interesting if research teams would try to collapse trained MoE into JoaT (Jack of all Trades - why not?).

With MoE architecture it should be efficient to back propagate other expert layers to align with result of selected one – at end changing multiple experts into multiple Jacks.

Having N multiple Jacks at the end is interesting in itself as you may try to do something with commonalities that are present, available on completely different networks that are producing same results.