Hacker News new | ask | show | jobs
by yobbo 1069 days ago
When gradients become auto-correlated, they are not zero mean. In that, case vₜ can grow large and uₜ becomes very small. Maybe try centering the g² term in vₜ like (gₜ-mₜ)². Also, when restarting training, m and v are probably reset to zero.

Might not be related to the phenomena in the paper.