Hacker News new | ask | show | jobs
by prideout 38 days ago
This is a fascinating mathematical framework, but the post title might be a bit of an overreach. I often wonder if "a theory of deep learning" could exist that could be stated succinctly and that could predict (1) scaling laws and (2) the surprising reliability of gradient descent.

Note that I said "predict" not "describe". It feels like we're still in the era of Kepler, not Newton.

2 comments

I dunno... gradient descent is only really reliable with a big bag of tricks. Knowing good initializations is a starting point, but recurrent connections and batch/layer normalization go a very long way towards making it reliable.
I agree, this is the correct way to see it IMO. Instead of designing better optimizers, we designed easier parameterizations to optimize. The surprising part is that these parameterizations exist in the first place.
Gradient descent is mathematically the most efficient optimization strategy (safe for some special functions) in high dimensions. This goes so far that people nowadays even believe it has to be used in the human brain [1], if only because every other method of updating the brain would be way too energy inefficient. From that perspective, finding the right parameterization was all we ever needed to achieve AI.

[1] https://physoc.onlinelibrary.wiley.com/doi/full/10.1113/JP28...

Even in supervised ML, pure gradient descent is not the most efficient optimization strategy. E.g., momentum is ubiquitous, and the updates it induces cannot be expressed as a gradient of some scalar loss. But the rotational non-gradient component of its updates substantially improves performance and convergence on the architectures we use.

The brain probably primarily uses something like TD for task learning, which is also not expressible as a gradient of any objective function. And, though the paper mentions Hebbian learning, it's only very particular network architectures (e.g. single neuron; symmetric connections) that you can treat its updates as a gradient of some energy function; these architectures aren't anything close to what we see in the brain.

Pure gradient descent is not what happens in either field, but e.g. momentum is just another parameter constructed from historic gradients. While it is unlikely that the brain runs backpropagation the way you see it implemented in modern ML (same goes for TD btw), the core principle kind of needs to be the same from a pure large scale, high dimensional network efficiency POV. On top of that, adaptive plasticity is almost by definition about estimating useful directions of change. The key insight here would be that the brain does gradient estimation quite cheap and we can probably still learn a thing or two about modern ML from it.
Taking a quick look at the paper...

Their claim isn't that the brain uses gradient descent, but that the direction of updates has (on average) positive inner product with the gradient. I expect this would also be true for (say) simulated annealing, yet we don't say that simulated annealing is gradient descent.

There's also a discussion of loss functions and how they relate to the update missing - as far as I know, there's still no great notion of how the brain picks a global loss function, and no mechanism for backprop. In this paper, looking at a specific learning task you can define a loss function extrinsically allowing us to talk about the gradient, but how that relates to things happening in the brain is a big big mystery.

Why would this be true for simulated annealing?
Because it improves the loss!

The gradient is the direction in which loss improves the fastest. Moving in a direction with a positive dot product with the gradient just means that you're (locally) improving the loss.

Hmm I'm not sure what you mean by "Gradient descent is mathematically the most efficient optimization strategy". Do you mean that gradient-based optimization in general? (in other words do you consider Adam gradient descent?)
[flagged]