Hacker News new | ask | show | jobs
by satvikpendem 1284 days ago
You are correct. Deepmind released a paper earlier this year showing that data is the primary constraint holding back these models, not their architecture size (ie a model with 5 billion parameters is not much better than one with 1 billion, but more data can make both much better) [0].

I will copy paste the main findings from the article here:

- Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big.

- If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models.

- If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would be a great loss relative to what would otherwise be possible.

- The literature is extremely unclear on how much text data is actually available for training. We may be "running out" of general-domain data, but the literature is too vague to know one way or the other.

- The entire available quantity of data in highly specialized domains like code is woefully tiny, compared to the gains that would be possible if much more such data were available.

[0] https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinc...

2 comments

Wonder how that relates to your earlier comment in the thread and if the impace of dataset quality on performance has been studied.
I'm not an ML engineer (anymore) so I don't know the particulars, but I'd say that while the amount of data matters, it's still better to have high quality data than to not have it.
This post is about image generation, not language models.
I'd imagine the situation is the same for image generation models too.