| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by minimaxir 902 days ago
	Constant learning rates were the default in older ML implementations, but linear decay became an obvious optimization, and now we have both warmup and cosine decay to handle common training patterns, especially with the AdamW optimizer. If the learning rate is too high at a given point in training, it can result in either a) the model stopping learning or b) exploding gradients, which is very bad.