Hacker News new | ask | show | jobs
by benlivengood 933 days ago
The scaling laws (the original Kaplan paper, Chinchilla, and OpenAI's very opaque scaling graphs for GPT-4) suggest indefinite improvement for the current style of transformers with additional pre-training data and parameters.

No one has hit a model/dataset size where the curves break down, and they're fairly smooth. Usually simple models that accurately predict performance work pretty well nearby existing performance, so I expect trillion or 10-trillion parameter models to be on the same curve.

What we haven't seen yet (that I'm aware of) is whether the specializations to existing models (LoRa, RLHF, different attention methods, etc.) follow similar scaling laws, since most of the efforts have been focused on achieving similar performance on smaller/sparser models and not investing the large amounts of money into huge experiments. It will be interesting to see what Deepmind Gemini reveals.

2 comments

This is the most accurate answer so far re. The scaling laws. It has been demonstrated that LLMs follow quite clear power laws with respect to performance. In fact, the performance of any model can be determined from the number of parameters it has and the amount of data it is given. The Wikipedia article on Neural Scaling laws provides a brief, accessible, summary of this. Both data and parameters are expected to increase in coming years, so models are expected to improve.
> No one has hit a model/dataset size where the curves break down, and they're fairly smooth.

The same was true of transistors, until it wasn't and they started diverging from the predictions about how they would behave when very small. Sometime around the late Netburst era (the Pentium 4/Netburst architecture was sunk by this problem - they assumed, designing it, that it would scale to 8-10GHz on a sane power budget, and it simply didn't as the "improvement per transistor shrink" became less and less).