Hacker News new | ask | show | jobs
by imjonse 1016 days ago
From the FAQ:

' Why would pretraining a 1.1B model for so long make sense? Doesn't it contradict the Chinchilla Scaling Law?

Above is the training loss curve taken from the Llama 2 paper. Here I quote from that paper: "We observe that after pretraining on 2T Tokens, the models still did not show any sign of saturation". That is why we believe pretraining a 1.1B model for 3T tokens is a reasonable thing to do. Even if the loss curve does not go down eventually, we can still study the phenomenon of saturation and learn something from it.'

It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down? Could they conceivably release a Llama 2.1 being checkpoints taken a month after 2.0 was 'cut'? Maybe the expected gain is too small compared to what can be gained with fine/instruct tuning afterward anyway?

3 comments

> It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down? Could they conceivably release a Llama 2.1 being checkpoints taken a month after 2.0 was 'cut'? Maybe the expected gain is too small compared to what can be gained with fine/instruct tuning afterward anyway?

Because choosing the LR decay requires knowing the # of steps in advance. LR is too small after the 2T tokens, and changing it afterwards doesn't tend to help.

https://twitter.com/sherjilozair/status/1687837844729966592

AFAIK re-warming it up and then gradually decreasing it again ought to work fine. Have you seen any research showing that it doesn't?
That would work, in that it would allow one to continue decreasing the loss, but I wouldn't say that it would work "fine". A model trained with restarts always performs worse than a model trained for the same duration without restarts.
> A model trained with restarts always performs worse than a model trained for the same duration without restarts.

Citation would be nice. From my experience restart sometimes is required. When model gets unstable and 'explodes', or gets stuck in some local minima. This is common with GANs. I usually rollback the model a bit, but keep the latest discriminator. So that discriminator 'knows' what to expect. It works in most cases, except for the 'fatality', when model blows up no matter what. That's the end of training.

I haven't seen any researcher that supports your contention. SGDR (SGD with restarts) has been shown to work well. https://arxiv.org/abs/1608.03983
You could manually increase the learning rate or change the decay at any time.
The most plausible explanation I've seen (other than the carmack 'sudden grokking' beyond the cutoff idea) is that they're planning to release llama3 sooner than later with some arcitecture changes for even better performance, so it makes sense to dedicate resources there instead.
> It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down?

If I remember correctly, it's because the main reason they trained multiple models was to show a scaling trend. Each model was trained using a chinchilla-optimal mix of model size, cpu amount, and parameter size. The point was to provide an empirical scaling law that could possibly be extrapolated to estimate the performance of more expensive models, like imagine a billion dollar model for which the model size, data size, and cpu amount is picked in the chinchilla optimal ratios.

On small models the chinchilla optimal scaling stops training the model even when the model is still improving.

The problem comes when people are actually using these small llama models rather than treating them as just data points. If you are actually using these models, what you want is one that is trained forever on as many tokens and training time as possible.