| You are correct. Deepmind released a paper earlier this year showing that data is the primary constraint holding back these models, not their architecture size (ie a model with 5 billion parameters is not much better than one with 1 billion, but more data can make both much better) [0]. I will copy paste the main findings from the article here: - Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big. - If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models. - If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would be a great loss relative to what would otherwise be possible. - The literature is extremely unclear on how much text data is actually available for training. We may be "running out" of general-domain data, but the literature is too vague to know one way or the other. - The entire available quantity of data in highly specialized domains like code is woefully tiny, compared to the gains that would be possible if much more such data were available. [0] https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinc... |