| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yobbo 1069 days ago
	When gradients become auto-correlated, they are not zero mean. In that, case vₜ can grow large and uₜ becomes very small. Maybe try centering the g² term in vₜ like (gₜ-mₜ)². Also, when restarting training, m and v are probably reset to zero. Might not be related to the phenomena in the paper.