Hacker News new | ask | show | jobs
by pretty_dumm_guy 2006 days ago
Hi Professor,

Good day,

I was wondering whether it is be possible for you to provide an overview of different methods that you think might have a better shot at replacing backpropagation algorithm?

1 comments

Sure. First of all, I want to say that backprop, by which I mean reverse-mode differentiation for computing gradients, combined with gradient descent for updating parameters, is pretty great. In a sense it's the last thing we should be trying to replace, since pretty much the whole deep learning revolution was about replacing hand-designed functions with ones that can be optimized in this way.

Reverse-mode differentiation has about the same time cost as whatever function you're optimizing, no matter how many parameters you need gradients for. This which is about as good as one could hope for, and is what lets it scale to billions of parameters.

The main downside of reverse-mode differentiation (and one of the reasons it's biologically implausible) is that it requires storing all the intermediate numbers that were computed when evaluating the function on the forward pass. So its memory cost grows with the complexity of the function being optimized.

So the main practical problem with reverse-mode differentiation + gradient descent is the memory requirement, and much of the research presented in the workshop is about ways to get around this. A few of the major approaches are:

1) Only storing a subset of the forward activations, to get noisier gradients at less memory cost. This is what the "Randomized Automatic Differentiation" paper does. You can also save memory and get exact gradients if you re-construct the activations as you need them (called checkpointing), but this is slower.

2) Only training one layer at a time. This is what the "Layer-wise Learning" papers are doing. I suppose you could also say that this is what the "feedback alignment" papers are doing.

3) If the function being optimized is a fixed-point computation (such as an optimization), you can compute its gradient without needing to store any activations by using the implicit function theorem. This is what my talk was about.

4) Some other forms of sensitivity analysis (not exactly the same as computing gradients) can be done by just letting a dynamical system run for a little while. Barak Pearlmutter has some work on how he thinks this is what happens in slow-wave sleep to make our brains less prone to seizures when we're awake.

I'm missing a lot of relevant work, and again I don't even know all the work that was presented at this one workshop. But I hope this helps.

> Barak Pearlmutter has some work on how he thinks this is what happens in slow-wave sleep to make our brains less prone to seizures when we're awake.

Interesting! I am more familiar with Pearlmutter's work on automatic differentiation, but was was unaware of this work with Houghton.

A new hypothesis for sleep: tuning for criticality: https://zero.sci-hub.se/2153/6c1cfbc1b78d23ef2e1cb7102dd8339...

There is also a related paper on wake-sleep learning from UofT, of which I am sure you are aware:

The wake-sleep algorithm for unsupervised neural networks: https://www.cs.toronto.edu/~hinton/absps/ws.pdf

Are you aware of any recent work investigating the role of sleep in biological and statistical learning?

Yes, "tuning for criticality" was the paper I was thinking of ! But I'm afraid I'm a dilettante when it comes to neuroscience. I basically just know the basic theories about consolidating learning during sleep.
Thank you for your answer. It appears to me that we are trying to achieve an algorithm that has better time complexity than the one that we have right now(reverse mode differentiation with gradient descent).

Is it possible to combine these methods in a straight forward manner with methods that try to reduce the space complexity? For example, Lottery ticket hypothesis(https://arxiv.org/abs/1803.03635) seems to reduce spacial complexity(Please do correct me if I am wrong).

Also, based on my rather poor and limited knowledge, it appears to me that set of proposed methods that reduced space complexity and set of proposed methods that reduce time complexity are disjoint. Is that the case ?

Thanks for your question! But as I said, no one is really worried about the asymptotic time complexity of reverse mode differentiation, although there is scope for improving constants). The main scope for improvement is in the space complexity.

There is a lot of work on trying to speed up optimization, for example the K-FAC optimizer by Roger Grosse that uses second-order gradient information in a scalable way.

The lottery ticket pruning strategies do reduce space complexity, but I think the main reason people are interested in it is to reduce training time complexity, or deployment memory requirements, but not so much training memory requirements.

As for whether memory-saving and time-saving approaches are disjoint, many methods (like checkpointing) introduce a tradeoff between time and space complexity, so no.

Thank you again for the clarifications. You have given me something to chew on over the holidays.

I wish you and your family a happy Christmas :)

(Lottery Ticket, to date, produces small networks ex post facto... You still have to train the giant network. There's also some indication that it's chancy on 'large' datasets+problems. https://arxiv.org/abs/1902.09574 )
There's also RevNets, which saves memory by only using reversible layers: https://arxiv.org/abs/1707.04585
Yes, good point!