| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hirako2000 97 days ago

They are fully aware, but are playing a different game, R&D isn't something you flip a parameter and you get what the efficient oriented pipelines do.

Chinese models were built on constraints. As we know limitations lead to innovation. So the "Chinese" R&D invested in optimisations. Teacher models were already there so they likely built the best distillation processes, along with the best MoE. Actually they published many of these works.

Nuance, sure. Anthropic/OpenAI could revise their philosophy to adopt efficiency.

But momentum can't be underestimated. Plus, dollar per optimisations is a different math altogether, it's not only about access to the latest Nvidia GPUs. At $400k the engineer pop a year, health coverage, pension contribution. Hardware efficiency doesn't weigh as much as making sure engineering focuses on.. the raw power factor, I suppose.

1 comments

yorwba 97 days ago

Every company is subject to constraints. A bigger budget is not an infinite budget. And there is no tradeoff between efficiency and raw power. An optimization that lets you build a similarly powerful model for less money also lets you build a more powerful model for the same amount of money.

Honestly, I wonder what you think closed LLM companies do R&D on if not optimizations. And the nature of research is that most ideas that sound good turn out duds, so they already need to have an established process for testing many ideas quickly. Now if somebody publishes a new idea they haven't tried yet, setting up an experiment to try it out is just a routine task... But they aren't going to tell anybody the results, just quietly integrate it if it works.

hirako2000 97 days ago

I concede we can't be sure what they do since it's proprietary. Aside leaks which give us a sense of the philosophy.

It's clear to me the economics would make the likes of OpenAI and Anthropic's focus on raw power over optimisations. I never meant they wouldn't optimise anything, but it's earlier diminishing returns vs for a company like Alibaba, or even Mistral.

The Chinese models were trained in the context of compute scarcity. So it isn't the same for them as "routine" optimisations, it's optimisations or nothing.

A year or two later those optimisations allowed their models to be somewhat on par with raw power models from the US providers.

Now despite papers being published, a design is rather sticky, it's not as simple as plugging an optimisations another lab came up with. It depends what the optimisation, perhaps multi head wasn't that big of a deal to add in, MoE would have been less so easy.