Hacker News new | ask | show | jobs
by fogof 1839 days ago
As a PhD student who sort of burned out on this type of research, I agree that the complexity of Neural Networks as a mathematical construct makes them very difficult to analyze. This might also have to do with Deep learning theory being a subset of learning theory which is subject to "No Free Lunch" [1], which means that you always have to be very careful not to try to prove something that turns out to be impossible.

That being said, research on the Kernel regime is one of the very cool ideas, in my opinion, to gain traction in this field in the past few years. To summarize: "If you make a neural network wide enough, it gains the power to control its output on each individual input separately, and will begin to fit its training data perfectly". Of course, the real pleasure is in understanding all the mathematical details of this statement!

[1] : https://en.wikipedia.org/wiki/No_free_lunch_theorem

1 comments

I got my master's years ago so now I'm a strict amateur. That said, I don't think the "No free lunch theorem" is very "interesting". It's nearly tautological that no approximation method works for "any" function. The set of predictable/interesting/useful/"real-world" functions is going to have measure 0 compared to white noise so "any function" will basically look like white noise and can't be predicted. Approximating functions/sequences with vanishingly low Kolmogorov complexity is more interesting, impossible in general by Godel's theorem but what's the case "on average"? (depends on the choice process and so ill-defined but defining might be interesting). The kernel regime stuff looks interesting but I don't know it's relation to wide networks.

Neural networks "tend to generalize well in the real world". That's a pretty fuzzy statement imo since "real world" is hardly defined but it's still what people experience and it's more useful to provide a more precise model where this works rather than a model where this doesn't work.

Also, there's good theory on deep networks as universal well as theories of wide/shallow networks [1].

[1]: https://arxiv.org/abs/1901.02220

> Neural networks "tend to generalize well in the real world".

I've always interpreted that as "we've found an algorithm that could, given a foreseeable amount of computing power and maybe some tweaks, simulate human decision making".

It isn't so much that neural networks can approximate the real world as they can approximate human perception of the real world.

Well, I quote the statement to show how vague it is, among things.

Neural networks are "universal approximators" in that they work as well as virtually any previous approximation method. So given big snapshot of input data and human judgement on it, they can approximate that. They can also approximate a snapshot of some input-output pairs not produced by human but having patterns (solutions to differential equations, for example).

So, they can approximate what humans do in a given domain. But there's no reason to think they're acting in the same way as humans and I'd say very few people seriously working on ML believe that.

This intuition is very dangerous and leads to huge misconceptions about deep neural nets.

Neural nets don't learn anything like us, and they don't reproduce our functions. We build on massive amounts of general symbolic knowledge, and can zero shot tasks (without explicit examples) easily.

Neural networks really should be seen as just giant random functions that you progressively modify in tiny ways until they fit your data. As parent says, we've just been lucky or good at constraining these functions in a way that they can only learn useful functions (ie convnets) or that they somehow learn these more quickly

Humans certainly do not build on massive amounts of symbolic knowledge because we are absolutely terrible at symbolic knowledge. Reliably reasoning through a basic logical argument is a specialist skill. Even reviewing evidence before making decisions is uncommon, most humans operate on a look -> assess -> do model where the tricky bit is well approximated by a neural net. Which is why neural nets seem to be so good at real-world tasks.

It is completely plausible that when neural nets get scaled up to something approaching human-brain numbers of connections they will well approximate a human brain or be a few tweaks away. Obviously it won't be knowable until state of the art gets there, but there is no reason to think human intelligence is going to be complicated. It is one evolutionary step up from some pretty basic animals.

Maybe you’re talking about a different kind of symbolic knowledge than the OP. To give an example humans can instantly tell whether an arbitrary sentence is grammatical or not which is a deep kind of symbolic reasoning that computers absolutely cannot do right now. And humans can also get the semantic meaning.

Then again math is hard for us. So I think there are nuances.

The fact that computers can't do sentence grammar and meaning right now doesn't tell us anything much about similarities or differences between humans and neural nets. It just tells us that training a neural net purely on a big corpus isn't enough to derive semantic meaning and makes it hard to work out grammatical meaning. No human has ever tried to do that either, everyone comes at text with some real-world experience. So we don't know how well they would do at it. Probably terribly.

It is reasonable to believe that written language is easier to train on a neural net that is trained on both images and words so it can form visual links between words. Maybe that takes more computational grunt than we have at the moment. The failure so far proves nothing.

instantly tell whether an arbitrary sentence is grammatical or not

You do realize we can train a neural network to perform this task? It is a binary classification problem. When I look at a grammatically incorrect sentence I don't do much symbolic reasoning - it just feels "wrong" to me. It does not match any patterns I have in my head for grammatically correct sentences. There's a lot of pattern matching in our thinking process.

What's missing in the current generation of neural networks is efficient information storage and ability to recall that information (e.g. lookup) or update it (direct write).

Neural nets fundamentally cannot operate the same way a brain does, because they cannot create an abstract representation of a problem, and then gradually and deliberately manipulate that mental model until they develop a solution. They just don't work that way, with current structures. They basically apply a single pass of a very complex function to the data, and spit out a result.

That isn't a problem of scale, it's a problem of architecture. This is one of the reasons Deepmind decided to tackle Starcraft. It's very difficult to solve Starcraft without your AI having some ability to develop and then manipulate a mental model of the game, because that's what you need to construct and unfold original, non-linear strategies.

Neural nets generalise because they have to approximate the data at a lower resolution, it's not that they're constrained to only learn what is useful. They're lossy compressors, but they have a unique property that most lossy compressors don't have. They cannot learn all the properties of the input data - partly because they can't hold that much information - but uniquely because neurons cannot be modified in isolation. A change in one neuron changes the influence of every other neuron in that layer, on the next layer. So it's difficult to learn granular properties of specific examples, because the entire net is affected when you do that (and many granular properties that are learned, will be unlearned in subsequent examples). The deeper the net, the less able earlier layers are to extract granular information from the input. They have to extract very abstract information, and they will gradually converge on an abstraction strategy that works.

That's why residual blocks are interesting. They pass that low-level information to later blocks (which have an easier time processing the granular details) while also leveraging the ability of earlier blocks to extract abstract information. It allows you to extract and combine information at multiple levels of granularity (or abstraction).

Convnets are also invariant to generalisation (e.g. translation, and to some degree scale), which I think is a better definition than "can only learn something useful." They're forced learn information that is more general, which increases the usefulness of each bit, which means you get a higher density of usefulness per FLOP. But you also lose specific information in that process. What if location is meaningful? For example, audio spectrogram analysis can suffer from that property, because specific location on the Y axis is highly meaningful.

What I meant by "forced to learn something useful" is what you put in a more clear way by being forced to generalize.
NFL theorems aren’t an argument about noise, they’re an argument about the uncountability of real numbers. NFL states that over all problems any optimization method performs equally poorly to any other, or equivalently, _that if an optimization method does well on some problems, it must do equally poorly on some other problems_, and those others aren’t necessarily noise, they could be anything. The problem is you don’t know which problems it is going to do poorly on in advance. You hope it does poorly on noise or on problems that you don’t care about, but you can’t tell. That is a very different statement than what you’re saying, and it’s as equally non-trivial as Godels and Turings statements in decidability.