| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by longtom 1955 days ago

tl;dr: Don't use batch norm for preventing exploding gradients but adaptive gradient thresholds.

For this they compute the Frobenius norm (square root of the sum of squares) of the weight layer and its gradient and take the ratio of these as clipping threshold.

That saves the meta search for the optimal threshold but also is better than a fixed threshold could ever be.

Very simple idea.

3 comments

sillysaurusx 1955 days ago

Thanks for macroexpanding frobnorm.

I'm skeptical that these hand-coded thresholds can ever match what a model can learn automatically. But it's hard to argue with results.

link

highfrequency 1955 days ago

Is this the first time that the gradient clip threshold has been chosen relative to the size of the weight matrix?

link

longtom 1955 days ago

Lol, why is my comment down here with 7 upvotes.

link