Hacker News new | ask | show | jobs
by fenomas 1165 days ago
I don't think this argument holds water at all. Can we imagine that the people who created Rust were able to do so only because they contributed some magical ingredient from their souls, which didn't exist anywhere in the world's collected pre-Rust printed materials? It's more economical to assume that they were able to create Rust because the necessary concepts already existed, in which case a sufficiently trained AI might do something similar.

Or working in the opposite direction: we can think of AIs as processing concepts in some dimensional space, sure. But we have no conception at all of what that space is like, so there's no reason to expect that a midpoint in that space between two objects we're familiar with would also be familiar to us. I mean, I have no idea what the midpoint between Rust and C++ is, or how I'd go about describing it. Surely an AI that thinks in tensors is more capable than we are to explore the space between known concepts, so why couldn't we expect to learn something novel from one?

3 comments

Concepts are developed by animals over time. A baby develops sensory-motor concepts from day-1; a child abstracts them; a teenager communicates them; and adult refines that communication.

They are not developed as a matter of averaging over all the text on the internet.

Concepts do not pre-exist concepts.

Respectfully, that sounds like hand-waving. Claiming to know where concepts do and don't come from just leads to questions like "did the natural numbers exist before we did?", which are centuries old and presumably not resolvable.

Whereas a more focused question like "can an AI produce outputs that are novel to someone familiar with all of the AI's inputs?" seems resolvable, and even if one thinks it's unlikely or not easy, it's very hard to buy the idea that it's impossible.

> just leads to questions

No, not really. People in this area are severely poorly informed on animal learning, and "ordinary science".

AI evangelists like to treat as "merely philosophical matters" profoundly scientific ones.

The issues here belong to ordinary science. Can a machine with access only to statistical patterns in the distribution of text tokens infer the physical structure of reality?

We can say, as certain as anything: No.

Associative statistical models are not phenomenological models (ie., specialised to observable cause-effect measures); and phenomenological models are not causal (ie., do not give the mechanism of the cause-effect relationship).

Further, we know as surely as an athlete catching a ball, that animals develop causal models of their environments "deeply and spontaneously".

And we know, to quite a robust degree, how they do so -- using interior causal models of their bodies to change their environments by intentional acts can confirm or disconfirm environmental models. This is modelled logically as abduction, causally as sensory-motor adaption, and so on.

This is not a philosophical matter. We know that "statistical learning" which is nothing more than a "correlation maximisation objective" over non-phenomenological, non-causal, non-physical data produces approximate associative models of those target domains -- that have little use beyond "replaying those associations".

ChatGPT appears to do many things. But you will see soon, after a year or two of papers published, that those things were tricks. That "replaying associations in everything ever written" is a great trick, that is very useful to people.

Today you can ask ChatGPT to rewrite harry potter "if harry were evil" or some such thing. That's because there are many libraries of books on harry potter and "evil" -- and by statistical interpolation alone, you can answer an apparent counter-factual question which should require imagination.

But give ChatGPT an actual counter-factual whose parts are only in the question, and you'll be out-of-luck.

Eg., tell it about tables, chairs, pens, cups and ask it to arrange them using given operations so that, eg., the room is orderly. Or whatever you wish.

Specified precisely enough you can expose the trick.

>This is not a philosophical matter. We know that "statistical learning" which is nothing more than a "correlation maximisation objective" over non-phenomenological, non-causal, non-physical data produces approximate associative models of those target domains -- that have little use beyond "replaying those associations".

Why do you think the data LLMs are trained on are non-causal? Lets take causation as asymmetric correlation. That is, (A,B) present in the training data does not imply (B,A) presence. But of course human text is asymmetric in this manner and LLMs will pick up on this asymmetry. You might say that causation isn't merely about asymmetric correlation, but that of the former determining the latter. But this isn't something we observe from nature, it is an explanatory posit that humans have landed on in service to modelling the world. So causation is intrinsically explanatory, and explanation is intrinsically causal. The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

Cashing out explanation and explanatory model isn't easy. But as a first pass I can say that explanatory models capture intrinsic regularity of a target system such that the model has an analogical relationship with internal mechanisms in the target system. This means that certain transformations applied to the target system has a corresponding transformation in the model that identifies the same outcome. If we view phenomena in terms of mechanistic levels with the extrinsic observable properties as the top level and the internal mechanisms as lower levels, an explanatory model will model some lower mechanistic level and recover properties of the top level.

But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.

Well this is a good reply, but it's mistaken.

The conditional probability:

    P(x[0]| x[-1], x[-2], x[-3] ...)
is not the same as,

    P(x[0] | x[-1], x[-2], ... -> x[0])
Where `->` says we select only those cases where x[-1],... brought-about x[0].

To see why this is the case, suppose we do have a god's eye-view of all of spacetime.

    P(A|B) 
    always selects for all instances where B follows A.

    P(A| B -> A) 
    selects only those instances where B's following A was caused by A.
Eg.,

    P(ShoesWet | Raining) 

    is very different from 

    P(ShoesWet | Raining -> ShoesWet)
in the former case the two events have, in general, nothing to do with each other.

To select "Raining -> ShoesWet" even with a gods-eye-view we need more than statistics... since those events which count as "Rain -> ShoeWet" have to be selected on a non-statistical basis.

For the athelete catching a ball, or the scientist designing the experiment, we're interested only in those causal cases.

For sure P(A|B) is a (approximate, statistical) model of P(A| B->A) -- but it's a very restricted, limited model.

The athlete needs to estimate P(ball-stops | catch -> ball-stops)

NOT P(ball-stops | catch) which is just any case of the ball-stopping given any case of catching.

Let me alter your example a bit: we have P(A|B), we want P(A|B,B->A). But given enough examples of the form P(A|B), a good algorithm can deduce B->A and use it going forward to predict A. How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases. LLMs do this with self-attention, by taking every pair of symbols in the context window and testing whether each pair is useful in determining the next token. As the attention matrix converges, the model can leverage the presence of "Raining & Outside" in predicting "ShoesWet".

Of course, this is a rather poor excuse for an explanation. The fact that "outside" and "raining" are close doesn't explain why "my shoes are wet". But it does get us closer to a genuine explanation in the sense that it eliminates a class of wrong possibilities from consideration: every sentence that doesn't have outside in proximity to raining downranks the generation "my shoes are wet". The model is further improved by adding more inductive relationships of this sort. For example, the presence of an expanded umbrella downranks ShoesWet, the presence of "stepped in puddle" upranks it. Construct about a billion of these kinds of inductive relationships, and you end up with something analogous to an explanatory model. The structural relationships encoded in the many attention matrices in modern LLMs in aggregate entail the explanatory relationships needed for causal modelling.

Wow, that's a nice way to put it. I haven't seen that P(A|B -> A) notation before. Where does it come from?

But I think the OP is arguing, essentially, that P(A|B -> A) is only an interpretation of P(A|B) that we have chosen, among the many possible interpretations of P(A|B).

Which I think evokes the problem of induction. How do we know that P(A| B -> A) when all we can observe ourselves is P(A|B)?

I very disagree but have an upvote for a well-argued comment.

>> The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

A statistical model may "capture" explanatory relations, but can it use them? A data scientist showing a plot to a bunch of people is explaining something using a statistical model, so obviously the statistical model has some explanatory power. But it's the data scientist that is using the model as an explanation. I think the discussion is whether a statistical model can exist that doesn't just "capture" an explanation, but can also make use of that explanation like a human would, for example as background knowledge to build new explanations. That seems very far fetched: a statistical model that doesn't just model, but also introspects and has agency.

Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom? The big debate is that (allegedly) "we don't understand language models" in the first place. We have a giant corpus of incomprehensible data; we train a giant black box model on it; now we have a giant incomprehensible model of the data. What did we explain?

>> But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.

Let's call that model M* for clarity. The search space of models, let's call it S. There are any number of models in S that can generate many of the same sequences as M* without being M*. The question is, and has always been, in machine learning, how do we find M* in S, without being distracted by M_1, M_2, M_3, ..., ... that are not M*.

Given that we have a very limited way to test the capabilities of models, and that models are getting bigger and bigger (in machine learning anyway) which makes it harder and harder to get a good idea of what, exactly, they are modelling, how can we say which model we got a hold of?

>A statistical model may "capture" explanatory relations, but can it use them?

That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction. It's the difference between frequency counting while taking the past context as an opaque unit vs decomposing the past context and leveraging relevant tokens for generation while ignoring irrelevant ones. Self-attention does this by searching over all pairs of tokens in the context window for relevant associations. Induction heads[1] are a fully worked out example of this and help explain in-context learning in LLMs.

>Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom?

The model encodes explanatory relationships of phenomena in the world and it uses these relationships to successfully generalize its generation out-of-distribution. Basically, these models genuinely understand some things about the world. LLMs exhibit linguistic competence as it engages with subject matter to accurately respond to unseen variations in prompts of that subject matter. At least in some cases. I argue this point in some detail here[2].

>how can we say which model we got a hold of?

More sophisticated tests, ideally that can isolate exactly what was in the training data in comparison to what was generated. I think the example of the wide variety of poetry these models generate should strongly raise one's credence that they capture a sufficiently accurate model of poetry. I go into detail on this example in the link I mentioned. Aside from that, various ways of testing in-context learning can do a lot of work here[3].

[1] https://transformer-circuits.pub/2022/in-context-learning-an...

[2] https://www.reddit.com/r/naturalism/comments/1236vzf/

[3] https://twitter.com/leopoldasch/status/1638848881558704129

> Can a machine with access only to statistical patterns in the distribution of text tokens infer the physical structure of reality? We can say, as certain as anything: No.

Um. How do you square that claim with the well-known Othello paper?

https://thegradient.pub/othello/

The board state can be phrased as moves. This paper profoundly misunderstands the problem.

The issue isn't that associative statistical models of domain Z aren't also models of domain Y where Y = f(Z) -- this is obvious.

Rather there are two problems, (1) the modal properties of these models arent right; and (2) they don't work where the target domain isn't just a rephrasing of the training domain.

>Concepts do not pre-exist concepts.

I think this is a very bold claim to make.

Each new idea/technology/concept stands on the back of all that came before it. You couldn't just pull a LLM or a dishwasher out of a hat 1000 years ago.

Right, but techniques like chain of thought reasoning can build concepts on concepts. Even if "the thing that generated the text" isn't creating new concepts, the text itself can be, because the AI has learned general patterns like reasoning and building upon previous conclusions.
> only because they contributed some magical ingredient from their souls

The fact that you turned a limitation of an specific algorithm into a call for magical powers shows quite a bit of bias on your part.

The bias is on the people asserting people have innate capabilities that are not a derivative of pattern recognition.
The argument is that humans interact with the world across many different modalities and do their statistical learning through this complex of interactions, while LLMs do their statistical learning just by what has been written (by humans) in certain internet sites.

I think it is a quite bold and philosophically poor statement to equate the "human training set" of complex interactions with the environment with what is written on the internet.

You’re arguing that the training set is different. You haven’t identified any different capabilities. What are the capabilities that make humans different?
The training sets are different in nature, not in the sense that 2 different LLMs' training sets are different. And that does not even touch that humans do not just learn from "training sets" but from interacting with the world. More like RL but not like ChatGPT's fine tuning; humans take _actions_ and they _experience_ their results in their totality, not just a "good/bad answer" feedback.

I am not saying that we cannot produce an AI with capabilities of that sort. But LLMs offer nothing at all to that direction. They can be useful in certain practical stuff, they are overhyped as hell, but they are not a step towards AGI.

You know, not all AI algorithms in use are derivative of statistical curve fitting.

But if you have some more general definition for "pattern recognition" than this, you should be perfectly able to notice that it's more general than what LLMs do.

Give an example of pattern recognition more general than what LLMs do.
> only because they contributed some magical ingredient from their souls, which didn't exist anywhere in the world's collected pre-Rust printed materials

You're focusing on the example too much. Here are more examples illustrating the question. It's doubtful that LLMs could infer solutions that lie outside their statistical models trained on existing data.

"In 1597 John Thorpe is the first recorded architect to replace multiple connected rooms with rooms along a corridor each accessed by a separate door" [1]

"Despite various publications of results where hand-washing reduced mortality to below 1%, Semmelweis's observations conflicted with the established scientific and medical opinions of the time and his ideas were rejected by the medical community. He could offer no theoretical explanation for his findings" [2]

"Button-like objects of stone, glass, bone, ceramic, and gold have been found at archaeological sites dating as early as 2000 b.c.e... One of the earliest extant pieces of clothing to show the use of buttons as fastenings is the pourpoint of Charles of Blois (c. 1319–1364)." [3]

And so on.

[1] https://en.wikipedia.org/wiki/Hallway?wprov=sfti1

[2] https://en.wikipedia.org/wiki/Ignaz_Semmelweis?wprov=sfti1

[3] https://www.encyclopedia.com/sports-and-everyday-life/fashio...

What you're saying makes sense, and I think I appreciate the point behind the examples you provided.

I think it would help your argument if you could point to such an example from the last couple of years, after the cutoff point of the LLM training data. Maybe though, nothing has been invented since then that is sufficiently unique. If there is something like that, I suppose it would be possible to try and prompt the LLM to create it. That would make your argument falsifiable and I'd be really curious to know the outcome.

The sum total of human knowledge has increased exponentially, so it's harder to come up with an example for that :)
Sorry, I don't follow at all. We can assume that every concept humans have conceived of was at some point conceived of for the first time, there's no need to list examples. But how does that relate to the claim here, that purports to constrain what outputs AIs are and aren't capable of generating?
LLMs currently statistically regurgitate existing data. An LLM in 1600s would tell you that a house layout is "rooms connected to each other" because that would be its pre-existing data. It remains to be seen if LLMs can come up with "oh wait? we can create a passageway, and have rooms open into that" based on satistical models of pre-existing data.

Can it come up with a corridor when it has no idea that such a concept exists? That remains to be seen.

> LLMs currently statistically regurgitate existing data.

NO! They do not.

Deep learning models are "universal approximators". Any two-layer neural network with enough parameters, data and training is a universal approximation. That means they can learn ANY relationship with an arbitrary accuracy.

Going beyond two layers, with several layers, problem domain structured architectures, and recurrent connections, they become far more efficient and effective.

So yes, they learn associations, correlations, stochastic models, statistics.

But they also learn to model functional relationships. Which is why they are able to generalize relationship to new situations, and combine previously unrelated relationships in reasonable and surprising ways.

A large part of creativity is putting together previously unrelated concepts and then letting the obvious logic of those relationships combine to result in something new an unexpected.

Note that both combining normally unrelated things, and combining the concepts in some way more or less consistent with what those concepts normally mean, is well within the grasp of current models.

They haven't outclassed out best thinkers. Or any of our best thinking as individuals yet. They are still very limited on problems that require many steps to think through.

But they are definitely, within their limits, being creative.

And they are far, far, FAR from just being statistical parrots.

> and combine previously unrelated relationships in reasonable and surprising ways.

We've yet to see those surprisng ways despite all the claims.

Note: what they do already is amazing and surprising in itself (such as "write me a song about quantum physics suitable for a 5 year old"). It's still very much shy of "hey there's this new thing previously unthought of".

> We've yet to see those surprisng ways despite all the claims.

This is the one reason everyone is finding them fascinating.

Perhaps you find them boring. Rote. Or something. But the reason non-technical people, as well as technical people, are enjoying and learning by interacting with chat and other models is how often the results are interesting.

I asked ChatGPT-4 to create a Dr. Seuss story about Cat in the Hat and my green conure parrot Teansy, that involved sewing and Italy. It produced a wonderful story of how they met in Italy, became friends, encountered a homeless child with a threadbare blanket and help the child. Then began helping others and ended up creating a fashion design studio.

All written in Dr. Seuss prose that made for a perfect children's book.

Pretty creative.

I then asked GPT to continue the story, but as a James Bond novel where one of Teansy's mysterious clients was actually a criminal using the fashion industry to hide his nefarious practices, and that Teansy should help James Bond solve the case.

For that I got another great story, completely consistent to James Bond tropes. It came up with a story line where the fashion industry was used to launder blood diamonds, which I thought was brilliant. A perfectly good rational for a James Bond villain. The story was great.

Throughout, Chat threw in funny suitable mentions about Teansy's fashion focuses, including feather lined wear, etc.

And all this creativity in a first draft written as fast as I could read it.

A year ago, nothing on the planet but a whimsical human with too much time (more time than it took Chat), on their hands could do this.

--

Obviously, we are discovering Chat can perform far more complex behaviors.

Act as any agent we describe including computer systems, or the internet. Respond quickly to feedback. Form plans. Learn and summarize the grammar of small artificial languages fairly well just from examples, ...

Without interacting with these models we would never have declared these were expected behaviors.

So I don't know what basis the emergence of these behaviors isn't surprising. Hoped for, envisioned, sure. But hardly an expression of obviously predetermined designed-in capabilities.

> Can it come up with...when it has no idea that such a concept exists?

I was curious to see if it could.

=== Q. Invent a new statistic that mixes hessian and likelihood with score.

gpt: One potential new statistical concept that mixes Hessian and likelihood with the score function could be called "Hessian-score likelihood ratio".

The idea behind this concept is to use the Hessian matrix, which captures second-order information about the likelihood function, along with the score function, which measures the sensitivity of the likelihood function to changes in the model parameters, to construct a likelihood ratio test statistic.

Q. How do I compute this new statistic ?

gpt. To calculate this statistic, we first compute the Hessian matrix of the log-likelihood function evaluated at the maximum likelihood estimate of the parameters. We then calculate the score function evaluated at the maximum likelihood estimate of the parameters.

We can then use these quantities to construct a test statistic that compares the likelihood of the model with the estimated parameters to the likelihood of the model with an alternative set of parameters. Specifically, we take the difference between the log-likelihoods of the two models, and divide by the square root of the determinant of the Hessian matrix, multiplied by the score function.

Q. Of what use is this new statistic ?

gpt. The resulting test statistic can be used to test hypotheses about the parameters of the model, such as whether they are equal to a particular value or whether they differ across groups or conditions. The Hessian-score likelihood ratio offers a potentially more powerful and efficient way to test such hypotheses than existing methods, since it makes use of both second-order information about the likelihood function and the sensitivity of the likelihood to changes in the parameters. ====

I am not able to spot any obvious inaccuracies in the above.

>> Can it come up with...when it has no idea that such a concept exists?

> I was curious to see if it could.

But you told it about the concept, "a new statistic that mixes hessian and likelihood with score".

You should try a different experiment. I'm more familiar with architecture than statistics so I'll use the floor plan example. Were someone in 16th century had asked its LLM to address the painpoints of the joined room approach, and then the LLM conceived of the novell concept of a corridor. [Look up the origin of the word.]

If that 16th century LLM spat out "the overall concept is to distinguish between transient and in repose spaces. There is already something similar in military architecture called 'corridor', which is a strip of land along the outer edge of a ditch. In these new floor plan designs, there will be corridors internal and peripheral to the building that will connect rooms, just like corridors (strips of land) connect lots or permit movement without stepping into a ditch".

-- can this happen? --

You: Invent a novel test statistic that can be used to test hypotheses about the parameters of the model, such as whether they are equal to a particular value or whether they differ across groups or conditions. You can combine existing statistical tools.

gpt: How about "Hessian-score likelihood ratio"? The idea behind this concept is to use the Hessian matrix, which captures second-order information about the likelihood function, along with the score function, which measures the sensitivity of the likelihood function to changes in the model parameters, to construct a likelihood ratio test statistic.

This is interpolation. And more than that, your prompt is the source of the actual novelty, little as it is.
> LLMs currently statistically regurgitate existing data.

This is clearly not true in any meaningful sense - c.f. the Othello paper, examples from the top of this very comment thread, etc.

> Can it come up with a corridor when it has no idea that such a concept exists?

Unless I'm missing something, the person I replied to is claiming that it categorically cannot come up with a concept it hasn't been trained on. I'm disagreeing - if a model knows about rooms and doors and floorplans, there's no obvious reason why it mightn't think up an arrangement of those things that would be novel to the people who trained it. If you think the matter remains to be seen, then I'm not sure what you disagree with me about.

In my experience, it can certainly be coaxed into discussing novel concepts that transcend existing knowledge. I'm having fun getting it to explain what a hybrid of a Nelson Enfilade data structure combined with a tensegrity data structure is and if that system is novel and brings any benefits, very interesting and novel afaik.
It seems like every time someone says that it's doing something novel, they present an example of interpolation between existing concepts.

This is useful, but the source of novelty here is the prompt; the rest is the work of interpolation.

This is all very reminiscent of image generation. There too, novelty is limited to interpolation.

> if a model knows about rooms and doors and floorplans, there's no obvious reason why it mightn't think up an arrangement of those things that would be novel to the people who trained it.

Once again, you're missing the point.

In 16th century people also knew about floors, and rooms, and floorpalns. And yet, the first architect to use a coridor used it for the first time in 1597.

What other "corridors" are missing from LLMs' training data? And we're sure it can come up with such a missing concept?

The Othello paper and the examples (are you referring to the example of coming up with new words?) are doing the same thing: they feed the model well-defined pre-established rules that can be statistically combined. The "novel ideas" are not even nearly novel because, well, they follow the established rules.

Could the model invent reversi/othello had it not known about it beforehand? Could the model invent new words (or a new language) had it not known about how to do that beforehand (there's plenty of research on both)? Can it satisfactorily do either even now (for some definition of satisfactorily)?

People believe it can only because the training set is quite vast and the work done is beyond any shadow of the doubt brilliant. That is why the invention of new words seems amazing and novel to many people while others even with a superficial armchair knowledge of linguistics are nonplussed. And so on.

> Could the model invent reversi/othello had it not known about it beforehand?

You've practically restated the paper's findings! :D The LLM knew nothing about othello; it wasn't shown any rules to be recombined. It was shown only sequences of 60 distinct tokens - effectively sentences in an unknown language. The LLM then inferred a model to predict the grammar of that language, and the authors demonstrated that its model functioned like an othello board.