| HN Mirror

> The idea was that a vendor has a number of LLMs fine-tuned for specific tasks, none necessarily better than other, and that they applied heuristics to decide which one to rope in for which queries.

That’s how people keep interpreting it but it’s incorrect. MoE is just a technique to decompose your single giant LLM into smaller models where a random one gets activated for each token. This is great because you need 1/N memory bandwidth to generate a token. Additionally, in the cloud, you split the model parts to different servers to improve utilization and drive down costs.

But the models aren’t actually separated across high level concepts.