| > we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem I am surprised and a bit disappointed this paper does not mention mean field theory or dynamical isometry at all. Mean field theory applies methods from physics - namely random matrix and free probability theory - to derive an exact analytical solution for information flow through a neural network. It turns out that simply initializing the weights of a plain CNN using a delta-orthogonal kernel allows all frequency components (Fourier modes) to propagate through the network with minimal attenuation. Specifically, networks train well when their input-output Jacobians exhibit dynamical isometry, namely the property that the entire distribution of singular values is close to 1. This technique effectively solves the exploding/vanishing gradient problem. The impact is shocking: the time to train a NN to a given accuracy becomes independent of network depth. No tricks like batch normalization, dropout, or anything else are needed. This insight has been proven for a wide range of architectures from plain FFNs to CNNs, RNNs, and even transformers. I highly recommended reading the papers “How to Train a 10,000 Layer Neural Network” [1], and “ReZero is All You Need: Fast Convergence at Large Depth” [2]. [1] https://arxiv.org/abs/1806.05393 [2] https://proceedings.mlr.press/v161/bachlechner21a/bachlechne... |