|
|
|
|
|
by yorwba
98 days ago
|
|
Anthropic is obviously also aware of the benefits of MoE and distilling a larger model into a smaller one, so they could run a model of the same size as Alibaba's for the same inference cost if they want to. Or they can run a slightly larger model for slightly higher cost. They definitely aren't running a much larger model (except potentially as a teacher for distillation training) because then they wouldn't be able to hit the output speeds they're hitting. |
|
Chinese models were built on constraints. As we know limitations lead to innovation. So the "Chinese" R&D invested in optimisations. Teacher models were already there so they likely built the best distillation processes, along with the best MoE. Actually they published many of these works.
Nuance, sure. Anthropic/OpenAI could revise their philosophy to adopt efficiency.
But momentum can't be underestimated. Plus, dollar per optimisations is a different math altogether, it's not only about access to the latest Nvidia GPUs. At $400k the engineer pop a year, health coverage, pension contribution. Hardware efficiency doesn't weigh as much as making sure engineering focuses on.. the raw power factor, I suppose.