| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jofi1 1024 days ago

> It is something I have been wondering about: why did Meta not keep the training process going on while the loss curves seemed to go down? Could they conceivably release a Llama 2.1 being checkpoints taken a month after 2.0 was 'cut'? Maybe the expected gain is too small compared to what can be gained with fine/instruct tuning afterward anyway?

Because choosing the LR decay requires knowing the # of steps in advance. LR is too small after the 2T tokens, and changing it afterwards doesn't tend to help.

https://twitter.com/sherjilozair/status/1687837844729966592

2 comments

jph00 1023 days ago

AFAIK re-warming it up and then gradually decreasing it again ought to work fine. Have you seen any research showing that it doesn't?

link

fpgaminer 1023 days ago

That would work, in that it would allow one to continue decreasing the loss, but I wouldn't say that it would work "fine". A model trained with restarts always performs worse than a model trained for the same duration without restarts.

link

two_in_one 1023 days ago

> A model trained with restarts always performs worse than a model trained for the same duration without restarts.

Citation would be nice. From my experience restart sometimes is required. When model gets unstable and 'explodes', or gets stuck in some local minima. This is common with GANs. I usually rollback the model a bit, but keep the latest discriminator. So that discriminator 'knows' what to expect. It works in most cases, except for the 'fatality', when model blows up no matter what. That's the end of training.

link

jph00 1023 days ago

I haven't seen any researcher that supports your contention. SGDR (SGD with restarts) has been shown to work well. https://arxiv.org/abs/1608.03983

link

charcircuit 1024 days ago

You could manually increase the learning rate or change the decay at any time.

link