Hacker News new | ask | show | jobs
by tppiotrowski 783 days ago
I took the Andrew Ng Coursera machine learning course in 2015 and to this day I still remember him saying this in one of the videos. At the time he was talking about various versions/optimizations of gradient descent but he essentially said that tweaking the algorithm will only make your model ~1% better while doubling the amount of training data will have a substantially larger impact (use any old algorithm but just throw more data at the problem). That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.
2 comments

There’s some enormous caveats to this.

The model architecture is 100% the thing that makes LLMs special. You would not get this doing token prediction with word2vec.

The model sizes are also hugely important. Adding billions of parameters does introduce the capability to fit to new features.

The models eventually reach saturation of how much they can fit to. There’s reason to believe that current LLMs are underfit to what their sizes could theoretically utilize, but it could also be that the optimization algorithms are simply not capable of easily and efficiently utilizing another 2x data to fill out the space. Doubling the model size, on the same training data, and letting it be even more underfit could result in a better model.

> That's why it was already evident back then that Google, Facebook, etc were sitting on a goldmine because in the long run those with the most data, not the brightest PhDs will win this race.

So far it doesn't seem to be panning out that way though. Companies such as OpenAI, Anthropic and Reka don't have any special internal sources of data, yet all have trained SOTA models.

Probably the main reason for this is that data type/quality matters more than quantity, which is why most of these companies are now using self-generated synthetic data.

The companies/institutes that will have a data advantage are those that have private datasets consisting of a different type (or maybe higher quality?) of data than publicly available, but this seems more likely to be in specialized domains (medical, etc), rather than what is useful for general intelligence.

I assume that, longer term, we'll have better AI architectures capable of realtime learning, and then the focus may switch on-the-job training and learning ability, rather than data.