Hacker News new | ask | show | jobs
by ekleraki 1253 days ago
A single linear layer is for all intents and purposes equivalent to running an ensemble of linear estimators. By disallowing gradients to flow between two layers A, B, computing (B . f . A)(x) with f being a non linearity, the second layer is an ensemble of linear estimators of the outputs of the first, and for all intents and purposes, the output of (f.A)(x) is just preprocessing for B.

Since gradients don't flow from B to A in (B.f.A)(x), A is trained independently of B, meaning that the training distribution of B changes without B influencing it, i.e. context drift. B doesn't know the difference, and B doesn't influence it.

For all intents and purposes, you can compute all the outputs as training of A happens, meaning training A to completion, and then feed them into B and B will still compute the same outputs and derivatives as it did before.

To deal with context drift, Hinton proposes normalizing the data, so the distribution does not change significantly.

Whatever he proposed is not "backprop-free" either. It still involves backprop, but the number of layers gradients flow through is 1, the layer itself.

The argument that you can still train through non-differentiable operations is not particularly convincing either; the reparameterization trick shows that is trivial to pass gradients through non differentiable operations if we are smart about it.

Given non differentiable operator Z: R^N -> R^N; let A, B, C be R^N -> R^N linear layer, B(Z(C(x)) * A(C(x))) allows gradients to flow through B and A all the way to C. The output of Z is for all intents and purposes a Hadamard product with (A . C)(x) that is runtime constructed and might as well be part of the input.

You can even run Z(C(x)) through a neural network and learn how to transform that and still provide useful and informative gradients back to C(x) via (A . C)

1 comments

I'm not sure what the main point is here. The paper is definitely sketchy on details, and the main idea is definitely simple enough to resemble a lot of other work. I wouldn't be surprised if someone (maybe a certain Swiss researcher) comes out and says, actually, this is the same as this other paper from the early 90s. If you squint hard enough a lot of ideas (especially simple ones) can be seen as being the same as other, older ideas. I'm not too interested in splitting those hairs, really. I'm more curious on whether this eventually leads to something that sets it apart from the SOTA in some interesting way.
My claim is that this work is simply worse ensembles wrapped in a biologically inspired claims, and that arguments made in support of it by the author compared to other approaches are simply not sound.

By looking at it through that perspective, the issues with the approach become evident, and are fundamental in my opinion.