Hacker News new | ask | show | jobs
by simonster 2948 days ago
There are a couple of factual errors here. First, the difference between backprop and evolution is smaller than the author indicates. The error signal used in modern backprop training is stochastic because it is computed on a minibatch (which is why it's called stochastic gradient descent). This stochasticity seems important to achieving good results. And the most popular evolutionary algorithm in the deep learning world is Evolution Strategies, which effectively approximates a gradient. Ordinary genetic algorithms are not gradient-based and have recently shown promise in limited domains, but can't compete with gradient-based algorithms for supervised learning.

The key claim in the article, that gradient descent could not discover physics from equations seems, like it is a statement about neural networks, not gradient descent. Given sufficient training data, a neural network can probably learn to model physics. I sympathize with the concern that it's very difficult to translate a neural network's knowledge into human concepts, but I see no reason to believe that optimizing the same system with an evolutionary algorithm would make this problem any easier. You could e.g. try to do program induction (which was supposed to be the future of AI many decades ago) instead of modeling the data directly, but choosing to perform program induction does not preclude the use of a neural network. Neural networks trained by gradient descent can generate ASTs (e.g. http://nlp.cs.berkeley.edu/pubs/Rabinovich-Stern-Klein_2017_...).

[Edited to remove reference to universal approximation; as comments point out, even if a neural network can approximate a function, it isn't guaranteed to be able to learn it. But I am reasonably confident that a neural network can learn Newton's second law.]

7 comments

Recently I was involved in calibrating a thermal infrared camera at work. A colleague out of curiosity tried to use machine learning and ended up with model containing hundreds of parameters (weights). Yet it was not better than a trivial model using a Planck integral (based on simple assumption about how things worked) and a linear regression (to account for systematic errors), 2 parameters in total. And the simple model completely ignored time dependencies assuming thermal stabilization which could be accounted using a couple of extra parameters based on a typical solution of heat transfer equation. Initially it was puzzling as I thought that Planck integral should be easily modeled with basic blocks of ML models. But then I realized that Planck function in our case was integrated over a complex profile of an infrared filter and may not be something that is easy to capture within ML.
ML is a blunt instrument in such situations. I think your example illustrates the point of the original article very nicely.
Given sufficient data, according to the Universal Approximation Theorem, a neural network can learn to model physics.

The ability of a system of linked functions to approximate any continuous function seems rather far from the ability to "learn modern physics".

It would seem like knowing modern physics would involve symbolic calculations rather than just approximating the behavior of any system.

A lot of physics is functions with singularities. ANN can only approximate these to a specified limit...

I want to see a neutral network that correctly solves SAT-3.

> Given sufficient data, according to the Universal Approximation Theorem, a neural network can learn to model physics.

It just says there are weights to approximate any function, not that you can actually learn the weights. Neural networks trivially can't learn how to approximate noncomputable functions to any accuracy, and there might be a lot of other functions that neural networks are terrible at actually learning.

I am waiting for this uneducated drivel of explaining NN performance by their 'universal function approximator property' to stop. There are tons other schemes that are also universal approximators, they were known before NN was a thing. Why don't we use those ? Why don't they work as well ?

Learning from examples and generalizing is a much different problem from function approximation.

Maybe suggest using polynomials instead of neural networks next time that happens? :)
It's a fair point that the Universal Approximation Theorem does not guarantee that the weights can be learned. OTOH, the physical laws that the article states a neural network cannot discover are computable functions.
You need a stronger bound than this. They have to be possible to approximate govern specific network size, architecture and activation functions. Calculating that (or good statistics that will say so approximately) is a hard problem... It is solvable for a bunch of activations in a layered perceptron but attempt extending this to something more complex.
I had to make a few simplifications to spell out the differences clearly and avoid making the text infinitely long. It's true that most current gradient descent algorithms are stochastic because they are computed in batch mode, and that sophisticated evolution strategies approximate the gradient. I still think the differences are significant, in that evolution updates less often and the direction of the update is much less (if at all) dependent on the feedback.

Now, your point about to what extent this is really about neural networks is a good one. Could a network learn F=ma, even if we could not interpret it? Maybe. With the right data, represented the right way.

>This stochasticity seems important to achieving good results.

No, it is not and may be counter resultive, so to say.

https://arxiv.org/pdf/1605.02026.pdf - page 8, figure 2(b). SGD optimized neural networks stops learning at the accuracy at which whole-dataset methods start!

Also please note that the figure I pointed to is about high energy particles analysis. SGD trained NN cannot even distinguish particles with good precision, let alone discover physics.

Also. Neural networks commonly use dropout regularization. In dropout your train only fraction (typically 50%) of randomly selected neurons. Effectively creating essembles.

Gradient descent and evolutionary algorithms (and many other search algorithms) advance in the hypothesis space with incremental (stochastic) steps and both algorithms are path dependent. How they generate and update their hypothesis, how big steps they take, how they represent their state, and how they apply randomness creates unique learning bias but there is nothing fundamentally different.

> Given sufficient training data, a neural network can probably learn to model physics.

Maybe basic Newtonian physics, but I seriously doubt any ANN we've built to date could come up with QM or Relativity no matter how wonderfully massive and accurate the data was.

Looks to me like those required sophisticated conceptual understanding of the world in addition to leaps of the imagination and creative thought experiments.