| HN Mirror

The original MoE research done by Google around LLMs involved nested transformers to scale them. It was a layered approach where at each layer you would have set of experts, generally routed to by simple heuristics, then each of those models would call into its own series of experts and combine the data in various ways.

These models were SOTA for their time