Hacker News new | ask | show | jobs
by duchenne 913 days ago
The most important paper to understand this issue is "Sacling Laws of Neural Language Models" by Open AI in 2020 [1]. Many consider it the most important paper that predicted the high performance of modern LLMs.

This paper shows how the loss decreases when you increase the model size, compute, or training dataset size.

From the article:

> Convergence is inefficient: When working within a fixed compute budget C but without any other restrictions on the model size N or available data D, we attain optimal performance by training very large models and stopping significantly short of convergence.

It clearly states that when you are limited by your training time compute, you should under-train your model.

[1] https://arxiv.org/abs/2001.08361

1 comments

that paper is now considered to be a psyop fwiw - but in the direction of too little data, not too many layers
Can you clarify what you mean?
Because the training data/model size/compute tradeoff derived from that paper is highly suboptimal (too many parameters) compared to the ones from the later Deepmind scaling laws [1]. And then Meta researchers recommended using even smaller models, to trade-off training- and inference-time compute [2] (which I thought was pretty obvious if you care about more than just benchmarks).

[1] https://arxiv.org/abs/2203.15556 Training Compute-Optimal Large Language Models

[2] https://arxiv.org/abs/2302.13971 LLaMA: Open and Efficient Foundation Language Models

He seems to be implying that openai released that paper to throw others off the scent of the direction they were taking.