Hacker News new | ask | show | jobs
by leereeves 832 days ago
More compute also requires more data - scaling equally with model size, according to the Chinchilla paper.

How much more data is available that hasn't already been swept up by AI companies?

And will that data continue to be available as laws change to protect copyright holders from AI companies?

1 comments

It's not just the volume of original data that matters here. From empirics we know performance scales roughly like (model parameters)*(training data)*(epochs). If you increase any one of those, you can be certain to improve your model. In the short term, training data volume and quality has given a lot of improvements (especially recently), but in the long run it was always model size and total time spent training that saw improvements. In other words: It doesn't matter how you allocate your extra compute budget as long as you spend it.
In smaller models, not having enough training data for the model size leads to overfitting. The model predicts the training data better than ever, but generalizes poorly and performs worse on new inputs.

Is there any reason to think the same thing wouldn't happen in billion parameter LLMs?

This happens in smaller models because you reach parameter saturation very quickly. In modern LLMs and with current datasets, it is very hard to even reach this point, because the total compute time boils down to just a handful of epochs (sometimes even less than one). It would take tremendous resources and time to overtrain GPT4 in the same way you would overtrain convnets from the last decade.