| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aarondefazio 810 days ago
	The behavior is actually more complex than a 1/t schedule. It behaves like a linear decay schedule 1-t/T with fixed stopping time T, as if T had been chosen in advance as the current timestep. When warmup is included, this is similar to high performance triangular learning rate schedules. Schedules of the form 1/t schedules perform really poorly in practice, we actually did a large scale comparison that included them in a prior paper: https://arxiv.org/pdf/2310.07831.pdf