Hacker News new | ask | show | jobs
by joe_the_user 1839 days ago
I got my master's years ago so now I'm a strict amateur. That said, I don't think the "No free lunch theorem" is very "interesting". It's nearly tautological that no approximation method works for "any" function. The set of predictable/interesting/useful/"real-world" functions is going to have measure 0 compared to white noise so "any function" will basically look like white noise and can't be predicted. Approximating functions/sequences with vanishingly low Kolmogorov complexity is more interesting, impossible in general by Godel's theorem but what's the case "on average"? (depends on the choice process and so ill-defined but defining might be interesting). The kernel regime stuff looks interesting but I don't know it's relation to wide networks.

Neural networks "tend to generalize well in the real world". That's a pretty fuzzy statement imo since "real world" is hardly defined but it's still what people experience and it's more useful to provide a more precise model where this works rather than a model where this doesn't work.

Also, there's good theory on deep networks as universal well as theories of wide/shallow networks [1].

[1]: https://arxiv.org/abs/1901.02220

2 comments

> Neural networks "tend to generalize well in the real world".

I've always interpreted that as "we've found an algorithm that could, given a foreseeable amount of computing power and maybe some tweaks, simulate human decision making".

It isn't so much that neural networks can approximate the real world as they can approximate human perception of the real world.

Well, I quote the statement to show how vague it is, among things.

Neural networks are "universal approximators" in that they work as well as virtually any previous approximation method. So given big snapshot of input data and human judgement on it, they can approximate that. They can also approximate a snapshot of some input-output pairs not produced by human but having patterns (solutions to differential equations, for example).

So, they can approximate what humans do in a given domain. But there's no reason to think they're acting in the same way as humans and I'd say very few people seriously working on ML believe that.

This intuition is very dangerous and leads to huge misconceptions about deep neural nets.

Neural nets don't learn anything like us, and they don't reproduce our functions. We build on massive amounts of general symbolic knowledge, and can zero shot tasks (without explicit examples) easily.

Neural networks really should be seen as just giant random functions that you progressively modify in tiny ways until they fit your data. As parent says, we've just been lucky or good at constraining these functions in a way that they can only learn useful functions (ie convnets) or that they somehow learn these more quickly

Humans certainly do not build on massive amounts of symbolic knowledge because we are absolutely terrible at symbolic knowledge. Reliably reasoning through a basic logical argument is a specialist skill. Even reviewing evidence before making decisions is uncommon, most humans operate on a look -> assess -> do model where the tricky bit is well approximated by a neural net. Which is why neural nets seem to be so good at real-world tasks.

It is completely plausible that when neural nets get scaled up to something approaching human-brain numbers of connections they will well approximate a human brain or be a few tweaks away. Obviously it won't be knowable until state of the art gets there, but there is no reason to think human intelligence is going to be complicated. It is one evolutionary step up from some pretty basic animals.

Maybe you’re talking about a different kind of symbolic knowledge than the OP. To give an example humans can instantly tell whether an arbitrary sentence is grammatical or not which is a deep kind of symbolic reasoning that computers absolutely cannot do right now. And humans can also get the semantic meaning.

Then again math is hard for us. So I think there are nuances.

The fact that computers can't do sentence grammar and meaning right now doesn't tell us anything much about similarities or differences between humans and neural nets. It just tells us that training a neural net purely on a big corpus isn't enough to derive semantic meaning and makes it hard to work out grammatical meaning. No human has ever tried to do that either, everyone comes at text with some real-world experience. So we don't know how well they would do at it. Probably terribly.

It is reasonable to believe that written language is easier to train on a neural net that is trained on both images and words so it can form visual links between words. Maybe that takes more computational grunt than we have at the moment. The failure so far proves nothing.

the argument was about wether humans and neural nets learn in a similar way. I don't see how what you are saying has any impact on that
instantly tell whether an arbitrary sentence is grammatical or not

You do realize we can train a neural network to perform this task? It is a binary classification problem. When I look at a grammatically incorrect sentence I don't do much symbolic reasoning - it just feels "wrong" to me. It does not match any patterns I have in my head for grammatically correct sentences. There's a lot of pattern matching in our thinking process.

What's missing in the current generation of neural networks is efficient information storage and ability to recall that information (e.g. lookup) or update it (direct write).

"You do realize we can train a neural network to perform this task"

I'm doing a master's in deep learning for NLP and I'm not sure we can. Language modelling can't do this because grammatical yet semantically implausible combinations of words yield very low perplexity, like the classic being Noam Chomsky's "Colorless green ideas sleep furiously".

What would be a training set for this? I assume we would first try to do parsing to extract the grammatical role of each word. Then what would be the dataset? A massive attempt at generating the set of all possible trees that are grammatical?

I guess we could use massive textual datasets from reputable sources and extract their grammatical role tree, and learn from that. Generating negative examples with sufficient coverage would be very hard. Strict generative modelling without negative examples with good coverage would see the same problem as with language modelling, where acceptable but unlikely examples would have low perplexity despite being good.

It would seem to me that in order to generate negative examples with good coverage, your would need to have a man made program with a definition of what grammaticality means, which would make making a neural network useless to begin with.

Seems like the experts agree with my take: https://linguistics.stackexchange.com/a/1108

If we can train a computer to classify sentences as grammatical or not please let me know where. You’ll save the linguistics department a lot of money as they’ll no longer have to contact native speakers for this research.
Humans require fewer examples to learn language rules. It's not clear that humans use the same learning model a "deep net."
Neural nets fundamentally cannot operate the same way a brain does, because they cannot create an abstract representation of a problem, and then gradually and deliberately manipulate that mental model until they develop a solution. They just don't work that way, with current structures. They basically apply a single pass of a very complex function to the data, and spit out a result.

That isn't a problem of scale, it's a problem of architecture. This is one of the reasons Deepmind decided to tackle Starcraft. It's very difficult to solve Starcraft without your AI having some ability to develop and then manipulate a mental model of the game, because that's what you need to construct and unfold original, non-linear strategies.

Neural nets generalise because they have to approximate the data at a lower resolution, it's not that they're constrained to only learn what is useful. They're lossy compressors, but they have a unique property that most lossy compressors don't have. They cannot learn all the properties of the input data - partly because they can't hold that much information - but uniquely because neurons cannot be modified in isolation. A change in one neuron changes the influence of every other neuron in that layer, on the next layer. So it's difficult to learn granular properties of specific examples, because the entire net is affected when you do that (and many granular properties that are learned, will be unlearned in subsequent examples). The deeper the net, the less able earlier layers are to extract granular information from the input. They have to extract very abstract information, and they will gradually converge on an abstraction strategy that works.

That's why residual blocks are interesting. They pass that low-level information to later blocks (which have an easier time processing the granular details) while also leveraging the ability of earlier blocks to extract abstract information. It allows you to extract and combine information at multiple levels of granularity (or abstraction).

Convnets are also invariant to generalisation (e.g. translation, and to some degree scale), which I think is a better definition than "can only learn something useful." They're forced learn information that is more general, which increases the usefulness of each bit, which means you get a higher density of usefulness per FLOP. But you also lose specific information in that process. What if location is meaningful? For example, audio spectrogram analysis can suffer from that property, because specific location on the Y axis is highly meaningful.

What I meant by "forced to learn something useful" is what you put in a more clear way by being forced to generalize.
NFL theorems aren’t an argument about noise, they’re an argument about the uncountability of real numbers. NFL states that over all problems any optimization method performs equally poorly to any other, or equivalently, _that if an optimization method does well on some problems, it must do equally poorly on some other problems_, and those others aren’t necessarily noise, they could be anything. The problem is you don’t know which problems it is going to do poorly on in advance. You hope it does poorly on noise or on problems that you don’t care about, but you can’t tell. That is a very different statement than what you’re saying, and it’s as equally non-trivial as Godels and Turings statements in decidability.