|
|
|
|
|
by benlivengood
933 days ago
|
|
The scaling laws (the original Kaplan paper, Chinchilla, and OpenAI's very opaque scaling graphs for GPT-4) suggest indefinite improvement for the current style of transformers with additional pre-training data and parameters. No one has hit a model/dataset size where the curves break down, and they're fairly smooth. Usually simple models that accurately predict performance work pretty well nearby existing performance, so I expect trillion or 10-trillion parameter models to be on the same curve. What we haven't seen yet (that I'm aware of) is whether the specializations to existing models (LoRa, RLHF, different attention methods, etc.) follow similar scaling laws, since most of the efforts have been focused on achieving similar performance on smaller/sparser models and not investing the large amounts of money into huge experiments. It will be interesting to see what Deepmind Gemini reveals. |
|