| HN Mirror

A huge misconception is that MoE is an ensemble of discrete models, when it is in fact multiple FFNN modules that share an attention and embedding module.

Basically the idea is that there's some pars of the model (attention/embedding) that should be trained on everything and used in every inference and other parts (the FFNN) that are fine to specialize on certain types of data (via a routing module that is also trained).

[0] https://arxiv.org/pdf/1701.06538.pdf [1] https://arxiv.org/pdf/2112.06905.pdf

EDIT: Specifically GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). For each input token, e.g., ‘roses’, the Gating module dynamically selects two most relevant experts out of 64, which is represented by the blue grid in the MoE layer. The weighted average of the outputs from these two experts will then be passed to the upper Transformer layer. For the next token in the input sequence, two different experts will be selected.