|
|
|
|
|
by minimaxir
902 days ago
|
|
It was fun to follow the public TinyLlama loss curves in near real-time, although it showed that it can be frustrating since the loss curves barely moved down even after an extra trillion tokens: https://wandb.ai/lance777/lightning_logs/reports/metric-trai... (note the log-scaled X-axis) But they did move down and that's what's important. There should probably be more aggressive learning rate annealing for models trying to be Chinchilla-optimal instead of just cosine-with-warmup like every other model nowadays. |
|
My current understanding of the story is, to recap:
- First the game was increase model size massively
- For example GPT3 had 175B parameters, but less than 0.5T tokens of training data
- Then Chinchilla showed for a given compute budget we can scale better by increasing training data
- Now we have models like this, and Phi, that have over 1T trained tokens
For any model, the loss curve going down could mean it’s learning, or could mean it’s overfitting, we don’t know which without looking at validation loss, which is like a second set of test data the model hasn’t seen before.
So getting back to your comment, I thought they were actually a multitude of indicators that should be used, not just validation loss, to determine what would be gained with more training.