Hacker News new | ask | show | jobs
by mjburgess 1156 days ago
Concepts are developed by animals over time. A baby develops sensory-motor concepts from day-1; a child abstracts them; a teenager communicates them; and adult refines that communication.

They are not developed as a matter of averaging over all the text on the internet.

Concepts do not pre-exist concepts.

3 comments

Respectfully, that sounds like hand-waving. Claiming to know where concepts do and don't come from just leads to questions like "did the natural numbers exist before we did?", which are centuries old and presumably not resolvable.

Whereas a more focused question like "can an AI produce outputs that are novel to someone familiar with all of the AI's inputs?" seems resolvable, and even if one thinks it's unlikely or not easy, it's very hard to buy the idea that it's impossible.

> just leads to questions

No, not really. People in this area are severely poorly informed on animal learning, and "ordinary science".

AI evangelists like to treat as "merely philosophical matters" profoundly scientific ones.

The issues here belong to ordinary science. Can a machine with access only to statistical patterns in the distribution of text tokens infer the physical structure of reality?

We can say, as certain as anything: No.

Associative statistical models are not phenomenological models (ie., specialised to observable cause-effect measures); and phenomenological models are not causal (ie., do not give the mechanism of the cause-effect relationship).

Further, we know as surely as an athlete catching a ball, that animals develop causal models of their environments "deeply and spontaneously".

And we know, to quite a robust degree, how they do so -- using interior causal models of their bodies to change their environments by intentional acts can confirm or disconfirm environmental models. This is modelled logically as abduction, causally as sensory-motor adaption, and so on.

This is not a philosophical matter. We know that "statistical learning" which is nothing more than a "correlation maximisation objective" over non-phenomenological, non-causal, non-physical data produces approximate associative models of those target domains -- that have little use beyond "replaying those associations".

ChatGPT appears to do many things. But you will see soon, after a year or two of papers published, that those things were tricks. That "replaying associations in everything ever written" is a great trick, that is very useful to people.

Today you can ask ChatGPT to rewrite harry potter "if harry were evil" or some such thing. That's because there are many libraries of books on harry potter and "evil" -- and by statistical interpolation alone, you can answer an apparent counter-factual question which should require imagination.

But give ChatGPT an actual counter-factual whose parts are only in the question, and you'll be out-of-luck.

Eg., tell it about tables, chairs, pens, cups and ask it to arrange them using given operations so that, eg., the room is orderly. Or whatever you wish.

Specified precisely enough you can expose the trick.

>This is not a philosophical matter. We know that "statistical learning" which is nothing more than a "correlation maximisation objective" over non-phenomenological, non-causal, non-physical data produces approximate associative models of those target domains -- that have little use beyond "replaying those associations".

Why do you think the data LLMs are trained on are non-causal? Lets take causation as asymmetric correlation. That is, (A,B) present in the training data does not imply (B,A) presence. But of course human text is asymmetric in this manner and LLMs will pick up on this asymmetry. You might say that causation isn't merely about asymmetric correlation, but that of the former determining the latter. But this isn't something we observe from nature, it is an explanatory posit that humans have landed on in service to modelling the world. So causation is intrinsically explanatory, and explanation is intrinsically causal. The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

Cashing out explanation and explanatory model isn't easy. But as a first pass I can say that explanatory models capture intrinsic regularity of a target system such that the model has an analogical relationship with internal mechanisms in the target system. This means that certain transformations applied to the target system has a corresponding transformation in the model that identifies the same outcome. If we view phenomena in terms of mechanistic levels with the extrinsic observable properties as the top level and the internal mechanisms as lower levels, an explanatory model will model some lower mechanistic level and recover properties of the top level.

But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.

Well this is a good reply, but it's mistaken.

The conditional probability:

    P(x[0]| x[-1], x[-2], x[-3] ...)
is not the same as,

    P(x[0] | x[-1], x[-2], ... -> x[0])
Where `->` says we select only those cases where x[-1],... brought-about x[0].

To see why this is the case, suppose we do have a god's eye-view of all of spacetime.

    P(A|B) 
    always selects for all instances where B follows A.

    P(A| B -> A) 
    selects only those instances where B's following A was caused by A.
Eg.,

    P(ShoesWet | Raining) 

    is very different from 

    P(ShoesWet | Raining -> ShoesWet)
in the former case the two events have, in general, nothing to do with each other.

To select "Raining -> ShoesWet" even with a gods-eye-view we need more than statistics... since those events which count as "Rain -> ShoeWet" have to be selected on a non-statistical basis.

For the athelete catching a ball, or the scientist designing the experiment, we're interested only in those causal cases.

For sure P(A|B) is a (approximate, statistical) model of P(A| B->A) -- but it's a very restricted, limited model.

The athlete needs to estimate P(ball-stops | catch -> ball-stops)

NOT P(ball-stops | catch) which is just any case of the ball-stopping given any case of catching.

Let me alter your example a bit: we have P(A|B), we want P(A|B,B->A). But given enough examples of the form P(A|B), a good algorithm can deduce B->A and use it going forward to predict A. How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases. LLMs do this with self-attention, by taking every pair of symbols in the context window and testing whether each pair is useful in determining the next token. As the attention matrix converges, the model can leverage the presence of "Raining & Outside" in predicting "ShoesWet".

Of course, this is a rather poor excuse for an explanation. The fact that "outside" and "raining" are close doesn't explain why "my shoes are wet". But it does get us closer to a genuine explanation in the sense that it eliminates a class of wrong possibilities from consideration: every sentence that doesn't have outside in proximity to raining downranks the generation "my shoes are wet". The model is further improved by adding more inductive relationships of this sort. For example, the presence of an expanded umbrella downranks ShoesWet, the presence of "stepped in puddle" upranks it. Construct about a billion of these kinds of inductive relationships, and you end up with something analogous to an explanatory model. The structural relationships encoded in the many attention matrices in modern LLMs in aggregate entail the explanatory relationships needed for causal modelling.

> How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases.

But the machine doesn't know which are the right cases. We aren't presuming there's a column, Z = 1 for B -> A, and Z = 0 otherwise -- right?

The machine has no mechanism to distinguish these cases.

> testing whether each pair is useful in determining the next token

This isnt causation.

> every sentence that doesn't have outside in proximity to raining downranks the generation

So long as the sequential structure of sentences corresponds to the causal structure of the world: but that's kinda insane right?

We haven't rigged human language so that the distribution of tokens is the causal structure of the world. The reason text generated by LLMs appears meaningful is because we understand it. The actual structure of text generated isnt "via" a model of the world.

(Consider, for example, training an LLM on a dead untranslated language -- it's output is incomprehensible, and its weights are abitarily correlated with anything we care to choose.)

Nevertheless, given our choice of token, you do have a model which says:

    P(ShoesWet|~Rain) < P(ShoesWet | Rain) < P(ShoesWet|Rain & Outside)
That's true. But we're choosing these additional conjunctions because we already know the causal model; these conjunctions are how we're eliminating confounders to get an approximation close to the actual.

(Which you'll never get, the actual value is `1`. Iff A -> B, then P(A|B->A) = 1 -- this is a deductive inference necessary for ordinary science to take place).

In any case, P(A | B -> A) means without any confounders. To actually find the LLM's approximation of this we'd need to compute:

    P(A|B & C1 & C2 & C3 ...)  forall C_i..inf
And then find P(A|B & C') st. C' made P(A|B) maximally likely.

If you find a set of {C} st. P(A|B) has a high probability, you won't find causal conditions.

All that statistical association models here is, at best, salience -- not causal relevance.

Wow, that's a nice way to put it. I haven't seen that P(A|B -> A) notation before. Where does it come from?

But I think the OP is arguing, essentially, that P(A|B -> A) is only an interpretation of P(A|B) that we have chosen, among the many possible interpretations of P(A|B).

Which I think evokes the problem of induction. How do we know that P(A| B -> A) when all we can observe ourselves is P(A|B)?

> when all we can observe ourselves is P(A|B)?

No, we actually observe P(A | B -> A) where `B` is our body and `A` is some action we take on the world.

Hume was WRONG. Very wrong.

Statistical AI has the problem of induction; we have bodies, so we do not.

----

As for notation, I'm riffing of Judea Pearl's do notation.

He'd say, P(A|do(B))

but his `do` operator is slightly more general

Google: do-operator, causal analysis, judea pearl, etc.

I very disagree but have an upvote for a well-argued comment.

>> The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

A statistical model may "capture" explanatory relations, but can it use them? A data scientist showing a plot to a bunch of people is explaining something using a statistical model, so obviously the statistical model has some explanatory power. But it's the data scientist that is using the model as an explanation. I think the discussion is whether a statistical model can exist that doesn't just "capture" an explanation, but can also make use of that explanation like a human would, for example as background knowledge to build new explanations. That seems very far fetched: a statistical model that doesn't just model, but also introspects and has agency.

Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom? The big debate is that (allegedly) "we don't understand language models" in the first place. We have a giant corpus of incomprehensible data; we train a giant black box model on it; now we have a giant incomprehensible model of the data. What did we explain?

>> But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.

Let's call that model M* for clarity. The search space of models, let's call it S. There are any number of models in S that can generate many of the same sequences as M* without being M*. The question is, and has always been, in machine learning, how do we find M* in S, without being distracted by M_1, M_2, M_3, ..., ... that are not M*.

Given that we have a very limited way to test the capabilities of models, and that models are getting bigger and bigger (in machine learning anyway) which makes it harder and harder to get a good idea of what, exactly, they are modelling, how can we say which model we got a hold of?

>A statistical model may "capture" explanatory relations, but can it use them?

That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction. It's the difference between frequency counting while taking the past context as an opaque unit vs decomposing the past context and leveraging relevant tokens for generation while ignoring irrelevant ones. Self-attention does this by searching over all pairs of tokens in the context window for relevant associations. Induction heads[1] are a fully worked out example of this and help explain in-context learning in LLMs.

>Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom?

The model encodes explanatory relationships of phenomena in the world and it uses these relationships to successfully generalize its generation out-of-distribution. Basically, these models genuinely understand some things about the world. LLMs exhibit linguistic competence as it engages with subject matter to accurately respond to unseen variations in prompts of that subject matter. At least in some cases. I argue this point in some detail here[2].

>how can we say which model we got a hold of?

More sophisticated tests, ideally that can isolate exactly what was in the training data in comparison to what was generated. I think the example of the wide variety of poetry these models generate should strongly raise one's credence that they capture a sufficiently accurate model of poetry. I go into detail on this example in the link I mentioned. Aside from that, various ways of testing in-context learning can do a lot of work here[3].

[1] https://transformer-circuits.pub/2022/in-context-learning-an...

[2] https://www.reddit.com/r/naturalism/comments/1236vzf/

[3] https://twitter.com/leopoldasch/status/1638848881558704129

>> That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction.

That sentence should be decorated with the word "allegedly", or perhaps "conjecture"! In practical terms, I believe you are pointing out that language models of the GPT family are trained on a context surrounding, not just preceding, a predicted token. That's right (and it gets fudged in discussions about predicting the next token in a sequence), but we could already do that with skip-gram models, and with context-sensitive grammars, and dependency grammars, many years ago, and I don't remember anyone saying those were specially capable of capturing explanatory relations [1]. Although for grammars the claim could be made, since they are generally based on explanatory models of human language (but not because of context-sensitivity).

Anyway, I thought you were arguing that explanations are arbitrary, "explanatory posits", and wouldn't that mean that an explanation doesn't improve prediction? This is not to catch you in contradiction, I'm genuinely unsure about this myself. My understanding is that explanatory hypotheses improve predictions in the long run [2], but that's not to say that a predictive model will improve given explanations, rather explanatory models eventually replace strictly predictive models.

Are you saying that including explanations in training data can improve prediction? That would make sense, but this is very hard to do when training a predictive model on text. In that case, the explanations are at best hidden variables and language models are just not the right kind of model to model hidden variables.

Sorry, writing too much today. And I got work to do. So I won't bitch about "in-context learning" (what we used to call sampling from a model back in the day, three years ago before the GPT-3 paper :).

______________

[1] My Master's thesis was a bunch of language models trained on Howard Philips Lovecraft's complete works, and separately on a corpus of Magic: the Gathering cards. One of those models was a probabilistic Context-Free Grammar, and despite its context-freedom, and because it was a Definite Clause Grammar, I could sample from it with input strings like "The X in the darkness with the Y in the Z of the S" and it would dutifully fill-in the blanks with tokens that maximised the probability of the sentence. So even my puny PCFG could represent bi-directional context, after a fashion. Yet I wouldn't ever accuse it of being explanatory. Although I would say it was quite mad, given the corpus.

[2] I mention in another comment my favourite example of the theory of epicylces compared to Kepler's laws of planetary motion.

> Can a machine with access only to statistical patterns in the distribution of text tokens infer the physical structure of reality? We can say, as certain as anything: No.

Um. How do you square that claim with the well-known Othello paper?

https://thegradient.pub/othello/

The board state can be phrased as moves. This paper profoundly misunderstands the problem.

The issue isn't that associative statistical models of domain Z aren't also models of domain Y where Y = f(Z) -- this is obvious.

Rather there are two problems, (1) the modal properties of these models arent right; and (2) they don't work where the target domain isn't just a rephrasing of the training domain.

>Concepts do not pre-exist concepts.

I think this is a very bold claim to make.

Each new idea/technology/concept stands on the back of all that came before it. You couldn't just pull a LLM or a dishwasher out of a hat 1000 years ago.

Right, but techniques like chain of thought reasoning can build concepts on concepts. Even if "the thing that generated the text" isn't creating new concepts, the text itself can be, because the AI has learned general patterns like reasoning and building upon previous conclusions.