Hacker News new | ask | show | jobs
by adam_arthur 1182 days ago
Well, humans are just trained on language tokens too (and of course, supplementary images etc).

All the people stating that "real understanding" is significantly different than learning through inference of language are likely going to be proven wrong in the end. There's nothing special about humans that makes our thinking any more sophisticated. With enough examples and the right learning model, systems should be able to be implicitly inferred from language, just as humans infer systems from language.

If we can do it, why can't machines?

4 comments

The fact that humans pick up language so soon after birth is the motivating question behind the biggest theory in all of linguistics, namely Chomsky's Universal Grammar. The simple fact is, teacher never stands in front of the class and says "here is how you don't talk. Purple if alligator burp arming has why't." Yet, despite the paucity of negative examples, everyone figures it out. You can't explain that in the current paradigm. There's a lot you can do unreasonably well despite virtually no prior experience. You probably did not need to crash a car for 10k generations before finally making it down the street, nor simulate it in your head. We are missing something fundamental and algorithmic and can only patch over our lack of understanding with volumes of training data for so long.

The idea of "reinforcement learning" is just a rehash of "Hebbian learning". It works for some things, but you can't explain language acquisition is pure reward function and stats.

> Yet, despite the paucity of negative examples, everyone figures it out.

After spending more than a year babbling nonsense and discovering a tiny bit more every time about the meaning of certain combinations of phonemes based on the positive or negative response you get.

> You probably did not need to crash a car for 10k generations before finally making it down the street, nor simulate it in your head.

Are you sure we don't simulate in our head what would happen if we drove the car into the lamp post / brick wall / other car / person, etc.? I find it highly unlikely that this kind of learning does not involve a large amount of simulation.

> There's a lot you can do unreasonably well despite virtually no prior experience.

That's true, but there's a lot we can't do well without repetitive practice, and most things that we can do well in a one-shot fashion depend on having prior practice or familiarity with similar things.

You're digging your heals in on a rehash of a model from the 40s, glibly dismissing the problems it doesn't account for bought up by linguists in the 50s and 60s as if they are unaware that babies go through a period of babbling. The amount of time spent acquiring language is already priced in and not enough to account for as pure reward and training.

>Are you sure we don't simulate in our head what would happen if we drove the car into the lamp post / brick wall / other car / person, etc.?

You left out the 10k times part. You're ignoring the huge training data sizes these models need even for basic inferences. No, I don't think it takes all that much full scale simulation to distill car speed as a function of pedal parameters, and estimate the control problem needed.

In many instances, humans can seemingly extrapolate from far less data. The algorithms to do this are missing. Training with loads of more data isn't a viable long term substitution.

>> Training with loads of more data isn't a viable long term substitution.

Depends. In principle, you can't learn an infinite language from finite examples only and you need both positive and negative ones for super-regular languages. Gold's result and so on. OK so far.

The problem is that in order to get infinite strings from a human language, you need to use its infinite ability for embedding parenthetical sentences: John, the friend of Mary, who married June, who is the daughter of Susan, who went to school with Babe, who ...

But, while this is possible in principle, in practice there's only a limit to how long such a sentence can be; or any sentence, really. In practice, most of the utterances generated by humans are going to be not only finite, but relatively short, as in short "relative" to the physical limit of utterance length a human could plausibly produce (which must be something around the length of the Iliad, considering that said human should be able to keep the entire utterance in memory, or lose the thread; and that the Iliad probably went on for as long as one could stand to recite from memory. Or perhaps to listen to someone recite from memory...).

Obviously, there are only a finite number of sentences of finite length, given a fixed vocabulary, so _in practice_ language, as spoken by humans, is not actually-really infinite. Or, let's say that humans really do have a generator of infinite language in our heads, but an outside observer would never see the entire language being produced, because finite universe.

Which means that Chomsky's argument about the poverty of the stimulus might apply to human learning, because it's very clear we learn some kind of complete model of language as we grow up; but, it doesn't need to apply to statistical modelling, i.e. the approximation of language by taking statistics over large text corpora. Given that those large corpora will only have finite utterances, and relatively short ones at that (as I'm supposing above) then it should be possible to at least learn the structure of everyday spoken language, just from text statistics.

So training with lots of data can be a viable long term solution, as long as what's required is to only model the practical parts of language, rather than the entire language. I think we've had plenty of evidence that this should be possible since the 1980's or so.

Now, if someone wanted to get a language model to write like Dostoyevsky...

Your argument is that maybe we can brute force with statistics sentences long enough for no one to notice we run out past a certain point?

Everything you said applies to computers too. Real machines have physical memory constraints.

Sure the set of real sentences may be technically finite, but the growth per word is exponential and you don't have the compute resources to keep up.

Information is not about what is said but about what could be said. It doesn't matter so much that not every valid permutation of words is uttered, but rather that for any set of circumstances there exists words to describe it. Each new word in the string carries information in the sense it reduces the set of possibilities from prior to relaying my message. A machine which picks the maximum likelihood message in all circumstances is by definition not conveying information. Its spewing entropy.

Now, now. Who said anything about information? I was just talking about modelling text. Like, the distribution of token collocations in a corpus of natural language. We know that's perfectly doable, it's been done for years. And to avoid exponential blowups, just use the Markov property or in any case, do some fudgy approximation of this and that and you're good to go.

>> Your argument is that maybe we can brute force with statistics sentences long enough for no one to notice we run out past a certain point?

No, I wasn't saying that, I was saying that we only need to model sentences that are short enough that nobody will notice that the plot is lost with longer ones.

To clarify, because it's late and I'm tired and probably not making a lot of sense and bothering you, I'm saying that statistics can capture some surface regularities of natural language, but not all of natural language, mainly because there's no way to display the entire of natural language for its statistics to be captured.

Oh god, that's an even worse mess. I mean: statistics can only get you so far. But that might be good enough depending on what you're trying to do. I think that's what we're seeing with those GPT things.

It’s not a question of whether machines can do it at all. The question is whether our current approach of training LLMs can do it. We don’t know how the human brain works, so we have no idea if there’s something in the brain that is fundamentally different from training an LLM.

Obviously machines can theoretically do what a brain can do because a machine can theoretically simulate a brain. But then it’s not an LLM anymore.

It's a neural network at the end of the day... it can compute any result or "understand" any system if properly weighted and structured.

It may be that LLM style training techniques are not sufficient to "understand" systems, or it may be that at a certain scale of input data, and some fine tuning, it is sufficient to be indistinguishable from other training methods.

Many people's sense of what qualifies as "intelligence" are too grandiose/misplaced. The main thing differentiating us from a neural network is that we have wants and desires, and the ability to prompt and conduct our own training as a result of those.

A LLM isn’t going to learn how to drive a car because of how they are trained even if a neutral network could.

It isn’t that people’s views on intelligence are grandiose, it’s that the specific approach used has massive inherent limitations. ChatGPT 4 is still relatively bad at chess, 1 win, 1 draw, 1 loss vs a 1400 isn’t impressive objectively and looks much worse when considering the amount of processing power they are using. The only impressive thing about this is how general their approach is, but in a wider context it’s still quite limited.

IMO the next jump of being able to toss 100x as much pressing power at the problem will see LLM’s tossed aside for even more general approaches like say using YouTube videos.

> It's a neural network at the end of the day... it can compute any result or "understand" any system if properly weighted and structured.

That's not even remotely close to being demonstrated.

For one thing, neural networks can only approximate continuous functions.

For another, the fact that in principle there exists a neural network that can approximate any continuous function to arbitrary precision doesn't in any way tell us that there is a way to "train" that network by any known algorithm. There isn't even reason to believe that such an algorithm exists for the general case, at least not one with a finite number of examples.

Approximating continuous functions is likely quite the same as what people do too. You think there isn’t some mathematical model under the hood of how the brain works too? That it doesn’t break down into functions with interpretable results? Is it spiritual or mystical in your mind?

These takes are so bad and pervasive on here, honestly. This is what I mean by grandiose thinking.

A machine that approximates functions, that otherwise is indistinguishable from human, is effectively intelligent like a human. Incentives, wants, desires, and the ability to conduct our own training is the only difference at that point.

>> Is it spiritual or mystical in your mind?

No, they're just saying there are continuous functions, and then there are discrete functions, and neural nets can't approximate discrete functions, while humans certainly can (e.g. integer addition). And that even when it comes to approximating any continuous function, neural nets can do that in principle, but we don't know how to do it in practice, just like we know time travel, stable wormholes and the Alcubierre drive are feasible in principle, but we can't realise them in practice.

So please don't say it's "spiritual and mystical" in the other person's mind just because it's not very clear in yours.

Also, what the OP didn't say is that a Transformer architecture is not the kind of architecture used to show the universality of neural nets. That was shown for a multi-layer perceptron (MLP) with one hidden layer, not a deep neural net like a Tansformer, and certainly not a network with attention heads. If you wanted to be all theoretical about it and claim that because there's that old proof, someone will eventually find out how to do it in practice, then the Transformer architecture has already taken a wrong turn and is moving away from the target.

There aren't no universality results for Transformers. I mean, that would be the day! The reason that that proof was derived for a MLP with one hidden layer is that this makes the proof much, much easier, than if you wanted to show the same for another architecture.

I can ask an LLM what 2+2 is and it can answer with 4. That's a discrete result. So how is this different from human thinking? Where is your evidence that this is not a similar mechanism?

It gets some math wrong because it doesn't understand the "systemic" aspect of math, but who's to say that with minor training tweaks, or a larger dataset, it wouldn't be able to infer the system? Humans infer systems from language all the time. To say you need some specialized form of training beyond language inference is obviously wrong when you view how humans train, learn and understand. All of life is ingestion of information via language which produces systemic understanding.

I can play digital audio that's indistinguishable from acoustic, despite it not being a smooth function in practice. Similarly, a sufficiently advanced neural net can produce intellect-like results, even if there are aspects of the structure you say may not make it so.

Honestly, the perception you and many others seem to hold is that because something is mathematically explainable in such a way that you can "trivialize" its operation, makes it not intelligence. But you hold "intelligence" in too high a regard

> Approximating continuous functions is likely quite the same as what people do too.

In a very broad sense, if you just mean "the human brain also just approximates some class of functions", sure. However, human brains can surely represent many classes of non-continuous functions as well (tan, lots of piece wise functions, etc). And, crucially, some of these are necessary for our physical models of the world. So, if neural networks are limited to only representing continuous functions, that is a strong indication that they are fundamentally unable to mimic the human mind.

> You think there isn’t some mathematical model under the hood of how the brain works too?

Of course it does. I do believe that the mind is simply a program running on the physical computer that is our brain. And I am sure that some day we will be able to create an AI that is human-like, and probably much better at it, running on silicone.

That doesn't mean that we should believe every program running on silicone, despite somewhat obvious fundamental limitations, is going to be the next AGI any day now. That's all I'm trying to point out: neural networks are not a great model for AGI, and backpropagation/gradient descent as a training algorithm even less so.

The current models are still too simple and missing things we did not figure out; when a human (or other animal) learns, it only needs a tiny (compared to the corpus of text etc gpt gets trained on) corpus to become a smart human. So the model needs something we have built in that makes learning vastly more efficient. Then there will be another big jump. That’ll come, that or a new AI winter.
> Well, humans are just trained on language tokens too

This is not true at all

You can teach someone chess without language

And how would you do that? Show certain moves and point your thumb up for ok, down for not ok? Then sorry, you're still using a language, just without using words.
It doesn't seem so far-fetched to believe somebody could learn chess just by watching others play it, no language needed at all (except perhaps reading the body language of being glad to win). But I imagine LLMs will soon have the ability to turn image sequences into information that can be interpreted and ingested much the same way as they can text, and thus "learn" how to play chess just from analyzing videos of actual games being played.
In legend, Paul Morphy learned in this way.