I don't think MoE is the way forward. The bottleneck is memory, and MoE trades MORE memory consumption for lower inference times at a given performance level.
Before too long we're going to see architectures where a model decomposes a prompt into a DAG of LLM calls based on expertise, fans out sub-prompts then reconstitutes the answer from the embeddings they return.
Before too long we're going to see architectures where a model decomposes a prompt into a DAG of LLM calls based on expertise, fans out sub-prompts then reconstitutes the answer from the embeddings they return.