Hacker News new | ask | show | jobs
by ftxbro 1020 days ago
> It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down?

If I remember correctly, it's because the main reason they trained multiple models was to show a scaling trend. Each model was trained using a chinchilla-optimal mix of model size, cpu amount, and parameter size. The point was to provide an empirical scaling law that could possibly be extrapolated to estimate the performance of more expensive models, like imagine a billion dollar model for which the model size, data size, and cpu amount is picked in the chinchilla optimal ratios.

On small models the chinchilla optimal scaling stops training the model even when the model is still improving.

The problem comes when people are actually using these small llama models rather than treating them as just data points. If you are actually using these models, what you want is one that is trained forever on as many tokens and training time as possible.