| A single linear layer is for all intents and purposes equivalent to running an ensemble of linear estimators. By disallowing gradients to flow between two layers A, B, computing (B . f . A)(x) with f being a non linearity, the second layer is an ensemble of linear estimators of the outputs of the first, and for all intents and purposes, the output of (f.A)(x) is just preprocessing for B. Since gradients don't flow from B to A in (B.f.A)(x), A is trained independently of B, meaning that the training distribution of B changes without B influencing it, i.e. context drift. B doesn't know the difference, and B doesn't influence it. For all intents and purposes, you can compute all the outputs as training of A happens, meaning training A to completion, and then feed them into B and B will still compute the same outputs and derivatives as it did before. To deal with context drift, Hinton proposes normalizing the data, so the distribution does not change significantly. Whatever he proposed is not "backprop-free" either. It still involves backprop, but the number of layers gradients flow through is 1, the layer itself. The argument that you can still train through non-differentiable operations is not particularly convincing either; the reparameterization trick shows that is trivial to pass gradients through non differentiable operations if we are smart about it. Given non differentiable operator Z: R^N -> R^N; let A, B, C be R^N -> R^N linear layer, B(Z(C(x)) * A(C(x))) allows gradients to flow through B and A all the way to C. The output of Z is for all intents and purposes a Hadamard product with (A . C)(x) that is runtime constructed and might as well be part of the input. You can even run Z(C(x)) through a neural network and learn how to transform that and still provide useful and informative gradients back to C(x) via (A . C) |