Hacker News new | ask | show | jobs
by tlb 2951 days ago
I'm optimistic about the potential for evolutionary algorithms. I've used both EAs and gradient descent in developing robot controllers.

But the argument here about why gradient descent won't be able to learn certain things is weak. Thought experiments are not a reliable guide to what what GD can or can't do.

It's fair enough to say that F=ma and E=mc² aren't in the data. Indeed, it took thousands of years of human thought to arrive at them. So the argument "it's not clear how an algorithm could extract F=ma from the data" isn't a strong criticism, because humans also can't do it by induction.

The long process culminating in F=ma involved a lot of abstract symbolic thought. Whether human-level abstract symbolic thought can be learned through GD (probably in combination with some sort of Monte Carlo tree search) is an open question. It can only be answered by trying to build things and seeing if they work.

If you want to make an argument about the limits of GD and induction, it'd be better to compare to a problem humans can solve reliably, rather than an insight that one genius had after decades of thought while standing on the shoulders of other geniuses.

4 comments

The author of the article mistook the importance of environment interactivity, considering that the learning algorithm is the key difference.

I think the distinction here should be between dataset based learning and simulator based learning. The genetic algorithms mentioned in the article rely on a dynamic environment, not a static dataset. Given the dynamic environment (which is like an infinite dataset) gradient methods can learn just as well - look at AlphaGo for example. But when the model can't experiment / try new actions and see the effects, it can't separate causes from correlations.

You can extract only so much from a dataset, the model needs a way to cause and observe external effects. The environment could be the real world, a simulated world, a game, a meta neural net optimiser (AutoML), or any domain where the model can act and influence the path of learning and the environment by its previous actions.

I'm happy to see the boom in RL and simulator based learning in the last few years. It means we are on the right track.

Right, I think it's very easy to fall into a trap of thinking that something is too hard for ML methods, because the way we solve them is different than how computers can solve them.

There is ongoing work to try to see if we can learn NN-based controllers that do their own learning or thinking. The Neural Turing Machine work being the most famous, but there's also some amount of single-shot/few-shot learning literature that is exploring other ways of learning because trying to learn from examples via SGD is too slow.

EAs are certainly interesting and I think are show some interesting results for hyperparameter tuning/architecture search of neural nets where we don't have gradients, and doing it with far less computation than RL solutions.

I don't understand how E=mc^2 can not be in the data. If it's a universal law, isn't it in more (all) data than any pattern that isn't universal?
The article is making a simpler point than that. If I show you the table:

  A	B	C
  1	4	3
  20	45	15
  8	15	7
And so on for some arbitrary number of rows, you can look at the table all you want but you will not perceive "A+C=B". It's just not written there. To get A+C=B you have to generate something else in addition to the table, namely a hypothesis- but this is a creative act, not an empirical one.
In this case, any linear model (with reasonable, minimizable loss, which could be, say convex) will learn the correct thing.

Here's a quick gist[0] doing it using least-squares, and learning it exactly (also, for B in the third row you may have intended 35 instead of 45?).

This simple regression model learns exactly the weights (-1, 1)—equivalently, it learns -A + B = C.

------

[0] https://gist.github.com/guillean/6f3ff05fa99b2b377fcf309bdc4...

If you connect A and B as the input to a linear neural net, and train against C, it'll very quickly arrive at weights of [-1, +1] and be able to correctly predict C given A and B. Whether or not it represents it in notation humans are familiar with, it has learned it for the practical purpose of being able to compute the function.
But how would a neural network know to connect data for mass with data for the measured speed of light? Why would a neural network be looking for an equation for energy conversion in the first place? If you just provide tons of raw data from instruments, what does that mean? What do yo do with it?

Sure, a human can clean the data and put it into a format that gives meaningful results. But if we're just talking about an AI learning from raw data with no supervision, where does it even start?

As someone who finds this interesting but does not know enough to take a position, I think a bigger question would be how does it come up with the abstract concept of energy?

I am aware that Alpha Go Zero came up with various strategic abstractions of the game that are recognized by competent players, and some novel ones, but I do not know where this program and its self-play training stands in the dichotomy of this debate.

Go has a clear objective that you can train for. What would be the objective in coming up with a physics law from raw data? What would it even be training for? That sounds like asking whether DL could create a new board game from a bunch of data on human behavior.

> think a bigger question would be how does it come up with the abstract concept of energy?

I don't see how it could, and that's kind of how Kant argued against empiricism. You can't derive a conceptual understanding of the world from raw data. There's nothing in raw data to structure or make sense of it without some way to interpret the data. Even calling it data is an interpretive act (as opposed to noise).

Given a sufficiently large table, an agent can certainly perceive without acting (testing a hypothesis) that "A+C=B" through its compressibility; the most compact(least complex) representation of the data can replace the many individual datapoints in one column with a learned rule how to calculate it.

There's related research on how children learn language, namely, how much observed evidence (i.e. based on cases where we have monitored and counted every word a child has heard in their life) is needed for a child to switch from a "lookup table" approach for certain features to "rule based" approach (detectable by observing overregularization, applying a systematic rule even when the actual language, including examples the child has heard, has an exception to that rule) and then to a "rule+exceptions" correct understanding; the experiments point towards "learning a rule" then and only then when a "compressed representation" is beneficial from information theory point of view.

It seems to me that A+C=B and E=mc^2 are both written in, or consistent with, that table. But widening the context produces something different in each from a probabilistic perspective. A+C=B is 100% certain with narrow context, but falsified with a little more context, while E=mc^2 is less certain with narrow context, but increasingly certain with wider context.*

*Like, this table is on a computer connected to a global network, which is based on electromagnetism, which is intimately related to relativity. The more you understand what the table is, the more certain you can be of E=mc^2.

Yes, but this is just total energy. Try kinetic energy instead. (Even Newton's Ek=0.5 * mv^2 much less Lorentz special relativity or general relativity.)

Now the model with kinetic energy plus rest energy. A network unaware of time will be unable to figure it out. Especially the differential in velocity.

What you need to actually devise such laws is generalizing conflict-driven clause learning with some good rule to pick models, name them and enumerate them. E.g. defining minimum generalizing set of logic clauses with support for undecidable and uncomputable functions. (Which means deciding when to give up.) This is essentially the inverse of a MAX-SAT solver. Minimax logic representation so to speak.

Well, that hypotesis would be wrong in this particular case (check the middle row).

Which doesn't make your point wrong of course, and this is for a simple function.

Data has to be obtainable and is often dimensional.

How would you have measured the speed of light in the forties? Even having lots of test cases, you wouldn't be able to deduce something like that with test cases alone.

Speed of light has been experimentally determined long before the 40s. The discovery of the speed of light being constant lead to the special relativity (aka before GR which is the E=mcc). http://www.speed-light.info/measure/speed_of_light_history.h...
The speed of light is easy to measure. What's hard is measuring the change in mass when something gains or loses energy. A large nuclear reactor running for a year only loses a few grams of mass-energy -- you can't measure it accurately enough or be sure that the mass didn't go somewhere else like into the cooling water.
Technically with billions of events you can. This is how LHC and other particle accelerator data is analysed.

The interesting part is detecting systematic errors. (Human equivalent is "are my senses lying right now".)

In wikipedia and numerous references, I get the impression that EAs, evolutionary algorithms, are a huge and fuzzily defined field.

With Genetic algorithms, just one kind of evolutionary algorithm, the fitness, the mutation and the crossover function seem to require the implementor to look at the problem domain and "come up with something" whereas once you have a goal, gradient descent requires lots of tuning but is more or less defined, you can track how well you're doing and so-forth.

Perhaps there's something I'm missing. Pointers would be welcome.

Looked at: https://en.wikipedia.org/wiki/Evolutionary_algorithm https://en.wikipedia.org/wiki/Genetic_algorithm

Salimans et al evolved weights in convolutional neural nets -- the same structure of network that works with deep reinforcement learning -- to play Atari games from pixels. https://blog.openai.com/evolution-strategies/

So you can make it work with generic models. Although as you say, many of the big successes like the Backgammon players and Kosa's work on electronic circuits used domain-specific models.