| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rememberlenny 3093 days ago

[Text from post]

OK, Deep Learning has outlived its usefulness as a buzz-phrase. Deep Learning est mort. Vive Differentiable Programming!

Yeah, Differentiable Programming is little more than a rebranding of the modern collection Deep Learning techniques, the same way Deep Learning was a rebranding of the modern incarnations of neural nets with more than two layers.

But the important point is that people are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization.

An increasingly large number of people are defining the network procedurally in a data-dependant way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them. It's really very much like a regular progam, except it's parameterized, automatically differentiated, and trainable/optimizable. Dynamic networks have become increasingly popular (particularly for NLP), thanks to deep learning frameworks that can handle them such as PyTorch and Chainer (note: our old deep learning framework Lush could handle a particular kind of dynamic nets called Graph Transformer Networks, back in 1994. It was needed for text recognition).

People are now actively working on compilers for imperative differentiable programming languages. This is a very exciting avenue for the development of learning-based AI.

Important note: this won't be sufficient to take us to "true" AI. Other concepts will be needed for that, such as what I used to call predictive learning and now decided to call Imputative Learning. More on this later....

2 comments

platz 3093 days ago

> It's really very much like a regular progam, except it's parameterized, automatically differentiated, and trainable/optimizable.

> People are now actively working on compilers for imperative differentiable programming languages.

Do you have an example of either of these things?

joshuamorton 3093 days ago

I believe https://github.com/google/tangent counts, though I'm not 100% sure.

zengid 3090 days ago

From the thread, LeCun mentions:

"Look at papers by Jeff Siskind and Barak Barak A. Pearlmutter particularly their work on VLAD, Stalingrad and VLD."

chombier 3093 days ago

Erik Meijer gave a talk about this at KotlinConf last year https://www.youtube.com/watch?v=NKeHrApPWlo

bra-ket 3093 days ago

it's really a pity that after 75 years of AI research the best thing we've got is still based on gradient descent, a brute force trial and error.

log_base_login 3093 days ago

As much of a pity that, 70 years later, we are still using transistor based computers originally derived from three wires stuck in a piece of rock[1] by some very innovative fellows at Bell Labs[2]?

[1]http://images.computerhistory.org/revonline/images/500004836... [2]http://www.computerhistory.org/revolution/digital-logic/12/2...

nostrademons 3093 days ago

As much of a pity as that after 3 billion years, all life is still based upon selection bias and random genetic mutations, a brute force trial and error?

Cybiote 3093 days ago

This is a poor example considering evolution is anything but brute force trial and error. Each organism can be considered either an hypothesis or a collection of hypotheses about the background environment (https://arxiv.org/pdf/0911.1763v3.pdf, https://aeon.co/essays/consciousness-is-not-a-thing-but-a-pr...).

The algorithms which allow AlphaGo or the Libratus Poker AIs to achieve superhuman play have a direct correspondence with natural selection (and learning rate with selection strength). There is idea sharing between evolutionary game theory and learning, up to algorithms to play extensive form games with imperfect information such as this, based on replicator dynamics: http://dl.acm.org/citation.cfm?id=2617448

Evolution, per genome trajectory, also improves in its ability to evolve, as seen in evolvability.

Returning to the grandparent's original lament on stochastic gradient descent, I do agree with them. I suspect the need for better has not been seen due to non-supervised and incremental learning remaining minor areas of study. Artificial separation between learning and prediction allows "hacks" like batch-norm to somewhat suffice. It does seem unlikely that we will never need to take into account curvature of information manifolds while learning. Note that evolution uses curvature too.

Anything order than SGD has proven expensive but there are promising approximations, as found in KFAC or Projected Natural Gradient Descent, allowing me to close this post with yet another link between learning and evolution.

mturmon 3093 days ago

Following a gradient is smarter than trial and error. You can make an argument that, in high-dimensional parameter spaces, it’s hard to do better (because, gradient descent is linear in the number of dimensions).

Ordinary Metropolis-Hastings, for example, is closer to trial and error.

tmyklebu 3093 days ago

> You can make an argument that, in high-dimensional parameter spaces, it’s hard to do better (because, gradient descent is linear in the number of dimensions).

Nitpick: If you're interested in solving optimisation problems, it's very easy to do better than gradient descent. Gradient descent performs very poorly when the directional curvature of the objective function varies too much with the direction. Newton's method, quasi-Newton methods, and nonlinear conjugate gradient are some of the more ingenious, beautiful, and clean ways; there are also some dirty, hacky ways to go.

It is a little bit interesting that fancier optimisation algorithms than gradient descent are unnecessary or unhelpful in some large-scale applications.

mturmon 3093 days ago

You mention Newton’s method, but of course that requires second order information which, as I mentioned, is not generally workable in high dimensions. You have to be careful with quasi-Newton methods like conjugate gradient for the same reason.

tmyklebu 3093 days ago

> You mention Newton’s method, but of course that requires second order information which, as I mentioned, is not generally workable in high dimensions.

Why would you say that second-order information is "not generally workable in high dimensions"? We regularly run Newton's method on problems with tens of millions of variables today.

And Newton's method isn't the only way to use second-order information. It is easy to access, for example, Hessian-times-vector information using the same reverse-mode differentiation that's so popular today, using only a constant factor more time.

> You have to be careful with quasi-Newton methods like conjugate gradient for the same reason.

What reason, exactly, is that?

mturmon 3092 days ago

I meant, in general settings (ie, no special problem structure), that you need full Hessians for Newton’s method. And, regarding conjugate gradient, in (non-DL) settings that I’m used to, that for good results you need preconditioners which are also second order.

Could you provide a reference to a 10^7 size problem that is being optimized with Newton’s method? I’d be indebted.

tomrod 3093 days ago

> because, gradient descent is linear in the number of dimensions

Yes, but increasing dimensions does not mean the manifold of the problem space increases the same.

jclos 3093 days ago

Its simplicity is its power. More complex methods (e.g. second order methods) tend to get attracted to saddle points and produce bad results. Some metaheuristics like evolution strategies are also used in some specific cases (reinforcement learning). Minibatch gradient descent + reasonable minibatch size + some form of momentum is the best we have.

electricslpnsld 3093 days ago

If you can’t reasonably get at or use second order information, how else are you going to optimize arbitrary objectives?

Well, come to think of it it, why don’t DL approaches use BFGS instead of gradient descent?

fwilliams 3093 days ago

There is literature on Quasi-Newton and Krylov Subspace methods for training Neural Networks. For example, https://dl.acm.org/citation.cfm?id=3104516.

I think the primary reason that such methods are not used much in practice is memory and computational cost: each function evaluation is expensive and you need to solve a very large system at every iteration.

Also to reply to a sibling comment, you can add momentum and step length adjustments to second-order methods in much the same way as in steepest-descent to help escape saddles. The only difference is how the descent direction is chosen for the optimization.

steev 3093 days ago

This is correct - second order methods are great in theory, but they are generally computationally prohibitive for high dimensional problems.

jph00 3093 days ago

Second order methods are attracted to saddle points in high dimensional spaces. The math and practice of optimizing these surfaces has a lot of nuances like this so much of the stuff you learn in your convex optimization class doesn't apply too well.

steev 3093 days ago

Do you have any recommendations on sources to read about this? Everything I've read discusses the use of the Hessian to not only determine you are at a saddle point but to also use its eigenvalues to escape.

plafl 3093 days ago

I have not used it but there is an implementation in PyTorch: http://pytorch.org/docs/master/optim.html#torch.optim.LBFGS

bra-ket 3090 days ago

the question is - why do you need to optimize in the first place? why don't you look up an answer instead of solving a mathematical optimization problem?

pacavaca 3093 days ago

Assuming that AI tries to mimic the way humans learn and evolve, those methods haven't changed for hundreds of thousands of years and brute-force trial and error is just one of them. It's kind of fundamental...

comstock 3093 days ago

It doesn’t really try to mimic the way humans learn and evolve...

maxtollenar 3093 days ago

Mostly because the loss function space is not well understood, we need to do some kind of descent