|
|
|
|
|
by oofbaroomf
415 days ago
|
|
First off, they are basically completely different technologies, so it would be disingenuous to act like it's an apples-to-apples comparison. But a simple way to see it is that when you pick between multiple large models that have different strengths, you have a larger amount of parameters just to work with (e.g. Deepseek R1 + V3 + Qwen + LLaMA ends up being 2 trillion total parameters to pick from), whereas "picking" the experts in an MoE like has a smaller amount of total different parameters you are working with (e.g. R1 is 671 billion, Qwen is 235). |
|