|
|
|
|
|
by ouraf
1057 days ago
|
|
Isn't that more or less how GPT-4 works? multiple "expert" LLMs giving input depending on the context?[0] [0]https://the-decoder.com/gpt-4-architecture-datasets-costs-an... the biggest issue is if you have too many specialists and spin a lot of them to reply to the same query and after that discard the less optimal answers. Your answer quality might improve, but the computing costs could skyrocket without some smart filtering and distribution before you reach any LLM |
|
Basically the idea is that there's some pars of the model (attention/embedding) that should be trained on everything and used in every inference and other parts (the FFNN) that are fine to specialize on certain types of data (via a routing module that is also trained).
[0] https://arxiv.org/pdf/1701.06538.pdf [1] https://arxiv.org/pdf/2112.06905.pdf
EDIT: Specifically GLaM model architecture. Each MoE layer (the bottom block) is interleaved with a Transformer layer (the upper block). For each input token, e.g., ‘roses’, the Gating module dynamically selects two most relevant experts out of 64, which is represented by the blue grid in the MoE layer. The weighted average of the outputs from these two experts will then be passed to the upper Transformer layer. For the next token in the input sequence, two different experts will be selected.