|
|
|
|
|
by yobbo
1069 days ago
|
|
When gradients become auto-correlated, they are not zero mean. In that, case vₜ can grow large and uₜ becomes very small. Maybe try centering the g² term in vₜ like (gₜ-mₜ)². Also, when restarting training, m and v are probably reset to zero. Might not be related to the phenomena in the paper. |
|