Hacker News new | ask | show | jobs
by longtom 1955 days ago
tl;dr: Don't use batch norm for preventing exploding gradients but adaptive gradient thresholds.

For this they compute the Frobenius norm (square root of the sum of squares) of the weight layer and its gradient and take the ratio of these as clipping threshold.

That saves the meta search for the optimal threshold but also is better than a fixed threshold could ever be.

Very simple idea.

3 comments

Thanks for macroexpanding frobnorm.

I'm skeptical that these hand-coded thresholds can ever match what a model can learn automatically. But it's hard to argue with results.

Is this the first time that the gradient clip threshold has been chosen relative to the size of the weight matrix?
Lol, why is my comment down here with 7 upvotes.