|
|
|
|
|
by longtom
1955 days ago
|
|
tl;dr: Don't use batch norm for preventing exploding gradients but adaptive gradient thresholds. For this they compute the Frobenius norm (square root of the sum of squares) of the weight layer and its gradient and take the ratio of these as clipping threshold. That saves the meta search for the optimal threshold but also is better than a fixed threshold could ever be. Very simple idea. |
|
I'm skeptical that these hand-coded thresholds can ever match what a model can learn automatically. But it's hard to argue with results.