Hacker News new | ask | show | jobs
by ssivark 902 days ago
How crucial is it to freeze the learning rate schedule a priori, instead of tweaking it on the fly?
2 comments

Constant learning rates were the default in older ML implementations, but linear decay became an obvious optimization, and now we have both warmup and cosine decay to handle common training patterns, especially with the AdamW optimizer.

If the learning rate is too high at a given point in training, it can result in either a) the model stopping learning or b) exploding gradients, which is very bad.

Adaptive learning rate is a thing. For example, one scheme I've used before is to decrease the learning rate if the validation loss stops decreasing.

It's not clear to me if this is applicable to LLMs though.