Hacker News new | ask | show | jobs
by hackinthebochs 1156 days ago
>This is not a philosophical matter. We know that "statistical learning" which is nothing more than a "correlation maximisation objective" over non-phenomenological, non-causal, non-physical data produces approximate associative models of those target domains -- that have little use beyond "replaying those associations".

Why do you think the data LLMs are trained on are non-causal? Lets take causation as asymmetric correlation. That is, (A,B) present in the training data does not imply (B,A) presence. But of course human text is asymmetric in this manner and LLMs will pick up on this asymmetry. You might say that causation isn't merely about asymmetric correlation, but that of the former determining the latter. But this isn't something we observe from nature, it is an explanatory posit that humans have landed on in service to modelling the world. So causation is intrinsically explanatory, and explanation is intrinsically causal. The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

Cashing out explanation and explanatory model isn't easy. But as a first pass I can say that explanatory models capture intrinsic regularity of a target system such that the model has an analogical relationship with internal mechanisms in the target system. This means that certain transformations applied to the target system has a corresponding transformation in the model that identifies the same outcome. If we view phenomena in terms of mechanistic levels with the extrinsic observable properties as the top level and the internal mechanisms as lower levels, an explanatory model will model some lower mechanistic level and recover properties of the top level.

But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.

2 comments

Well this is a good reply, but it's mistaken.

The conditional probability:

    P(x[0]| x[-1], x[-2], x[-3] ...)
is not the same as,

    P(x[0] | x[-1], x[-2], ... -> x[0])
Where `->` says we select only those cases where x[-1],... brought-about x[0].

To see why this is the case, suppose we do have a god's eye-view of all of spacetime.

    P(A|B) 
    always selects for all instances where B follows A.

    P(A| B -> A) 
    selects only those instances where B's following A was caused by A.
Eg.,

    P(ShoesWet | Raining) 

    is very different from 

    P(ShoesWet | Raining -> ShoesWet)
in the former case the two events have, in general, nothing to do with each other.

To select "Raining -> ShoesWet" even with a gods-eye-view we need more than statistics... since those events which count as "Rain -> ShoeWet" have to be selected on a non-statistical basis.

For the athelete catching a ball, or the scientist designing the experiment, we're interested only in those causal cases.

For sure P(A|B) is a (approximate, statistical) model of P(A| B->A) -- but it's a very restricted, limited model.

The athlete needs to estimate P(ball-stops | catch -> ball-stops)

NOT P(ball-stops | catch) which is just any case of the ball-stopping given any case of catching.

Let me alter your example a bit: we have P(A|B), we want P(A|B,B->A). But given enough examples of the form P(A|B), a good algorithm can deduce B->A and use it going forward to predict A. How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases. LLMs do this with self-attention, by taking every pair of symbols in the context window and testing whether each pair is useful in determining the next token. As the attention matrix converges, the model can leverage the presence of "Raining & Outside" in predicting "ShoesWet".

Of course, this is a rather poor excuse for an explanation. The fact that "outside" and "raining" are close doesn't explain why "my shoes are wet". But it does get us closer to a genuine explanation in the sense that it eliminates a class of wrong possibilities from consideration: every sentence that doesn't have outside in proximity to raining downranks the generation "my shoes are wet". The model is further improved by adding more inductive relationships of this sort. For example, the presence of an expanded umbrella downranks ShoesWet, the presence of "stepped in puddle" upranks it. Construct about a billion of these kinds of inductive relationships, and you end up with something analogous to an explanatory model. The structural relationships encoded in the many attention matrices in modern LLMs in aggregate entail the explanatory relationships needed for causal modelling.

> How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases.

But the machine doesn't know which are the right cases. We aren't presuming there's a column, Z = 1 for B -> A, and Z = 0 otherwise -- right?

The machine has no mechanism to distinguish these cases.

> testing whether each pair is useful in determining the next token

This isnt causation.

> every sentence that doesn't have outside in proximity to raining downranks the generation

So long as the sequential structure of sentences corresponds to the causal structure of the world: but that's kinda insane right?

We haven't rigged human language so that the distribution of tokens is the causal structure of the world. The reason text generated by LLMs appears meaningful is because we understand it. The actual structure of text generated isnt "via" a model of the world.

(Consider, for example, training an LLM on a dead untranslated language -- it's output is incomprehensible, and its weights are abitarily correlated with anything we care to choose.)

Nevertheless, given our choice of token, you do have a model which says:

    P(ShoesWet|~Rain) < P(ShoesWet | Rain) < P(ShoesWet|Rain & Outside)
That's true. But we're choosing these additional conjunctions because we already know the causal model; these conjunctions are how we're eliminating confounders to get an approximation close to the actual.

(Which you'll never get, the actual value is `1`. Iff A -> B, then P(A|B->A) = 1 -- this is a deductive inference necessary for ordinary science to take place).

In any case, P(A | B -> A) means without any confounders. To actually find the LLM's approximation of this we'd need to compute:

    P(A|B & C1 & C2 & C3 ...)  forall C_i..inf
And then find P(A|B & C') st. C' made P(A|B) maximally likely.

If you find a set of {C} st. P(A|B) has a high probability, you won't find causal conditions.

All that statistical association models here is, at best, salience -- not causal relevance.

>We haven't rigged human language so that the distribution of tokens is the causal structure of the world [...] The actual structure of text generated isnt "via" a model of the world.

This is an odd claim. I certainly say that I picked my cup off the floor rather than I picked my cup off the ceiling because gravity causes things to fall down rather than up. Human language isn't "rigged" to represent the causal structure of the world, but it does nonetheless. The distribution of tokens is such that the occurrence of (A,B) and (B,A) are asymmetric, and this is precisely because of features of the world influence the distribution of words we use. A sufficiently strong model should be able to recover a model of this causal structure given enough training data.

>That's true. But we're choosing these additional conjunctions because we already know the causal model; these conjunctions are how we're eliminating confounders to get an approximation close to the actual.

But these patterns are represented in the training data by the words we use to discuss raining and wet shoes. There is every reason to think a strong model will recover this regularity.

>All that statistical association models here is, at best, salience -- not causal relevance.

That's all we can ever get from sensory experience. We infer causation because it is more explanatory than accepting a huge network of asymmetric correlations as brute. YeGoblynQueenne is right that my point is basically a version of the problem of induction. We can infer causation but we are never witness to causation. We do not build causal models, we build models of asymmetric correlations and infer causation from the success of our models. What a good statistical model does is not different in kind.

The problem of induction is fatal. But we overcome it: we do witness causation.

When I act on the world, with my body, I take as a given "Body -> Action". We witness causation in our every action.

> This is an odd claim

The tokens can be given any meaning. The statistical distribution of token frequencies in our languages have an infinite number of causal semantics which are consistent with them.

We can find arbitary patterns such that

    P(A) < P(A|B) < P(A|B & C) < P(A|B & C...)
Only those we give a semantics to ("Rain" = Rain), and only those we already know are causal we will count. This is the trick of humans reading the output of LLMs -- this is what makes it possible. It's essentially one big Eliza effect.

No, the structure of language isnt the structure of the world.

This pattern in tokens,

    P(A) < P(A|B) < P(A|B & C) < P(A|B & C...)

Is an associative statistical model of conditional aggregate salience between token terms.

Phrase any such conditional probability you wish, it will never select for causal patterns.

this is why we experiment. It's why we act on the world to change it.

When the child burns their hand on the fireplace they do so once. Why?

Because the child immediately infers,

    P(TouchFire -> Pain | MoveHand -> Pain) = 1
How? via the abduction, roughly:

    P( TouchFire | Desire_TouchFire -> TouchFire) = 1
how?

    P( TouchFire | Desire_TouchFire -> MoveHand) = 1
how?

    P( Pain | MoveHand -> TouchFire -> Pain) = 1
etc.

In other words, we bottom out our reasoning in a

    P( BodilyMovement -> Effect | Desire -> BodilyMovement) = 1

Absent this, absent being in the world with a body, you cannot determine causes.

The problem of induction phrased in modern language is this: statistics isn't informative. Or, conditional probabilities are no route to knowledge. Or, AI is dumb.

Wow, that's a nice way to put it. I haven't seen that P(A|B -> A) notation before. Where does it come from?

But I think the OP is arguing, essentially, that P(A|B -> A) is only an interpretation of P(A|B) that we have chosen, among the many possible interpretations of P(A|B).

Which I think evokes the problem of induction. How do we know that P(A| B -> A) when all we can observe ourselves is P(A|B)?

> when all we can observe ourselves is P(A|B)?

No, we actually observe P(A | B -> A) where `B` is our body and `A` is some action we take on the world.

Hume was WRONG. Very wrong.

Statistical AI has the problem of induction; we have bodies, so we do not.

----

As for notation, I'm riffing of Judea Pearl's do notation.

He'd say, P(A|do(B))

but his `do` operator is slightly more general

Google: do-operator, causal analysis, judea pearl, etc.

Ah, I thought it might be something to do with Judea Pearl.

>> Hume was WRONG. Very wrong.

Oh boy :)

I can see what you're saying about having bodies, but bodies are very limited things and that's just making Hume's point. We can only know so much by experiencing it with our bodies. We've learned a lot more about the world, and its foundations, thanks to our ability to draw inferences without having to engage our bodies. For example, all of mathematics, including logic that studies inference, is "things we do without having to engage our bodies". And those very things have shown us the limits of our own abilities, or at least our ability to create formal systems that can describe the world in its entirety. They have shown us the limits of our ability for inductive inference (and in a very concrete manner - see Mark E. Gold's Language Identification in the Limit).

Machine learning systems are more limited than ourselves, that's right. And that's because we have created them, and we are limited beings that cannot know the entirety of the world just by looking at it, or reasoning about it.

One of the premises of hume's sceptical metaphysics was that

    P(A|B) is just P(A | B -> A) 
The argument for this was `A` and `B` are only "Ideas in the head" and don't refer to a world. And secondly, by assertion, that Ideas are "thin" pictorial phenomena that can only be sequenced.

Hume here is just wrong. Our terms refer: `A` succeeds in referring to, eg., Rain. And our experiences aren't "thin", they're "thick" -- this was Kant's point. Our experiences play a rich role in inference that cannot be played by "pictures".

To have a metal representation R of the world is to have a richly structured interpretation which does, in fact, contain and express causation.

ie., R can quite easily be a mental representation of "B -> A". This, after all, is what we are thinking when we think about the rain hitting our shoes. We do not imagine P(A|B), we imagine P(A|B->A) -- if we didnt, we couldn't reason about the future.

The question is only how we obtain such representations, and the answer is: the body with its intrinsic known causal structure.

Whenever we need to go beyond the body, we invent tools to do so -- and connect the causal properties of those tools to our body.

Hume here is wrong in every respect. And it's his extreme scepticism which undergirds all those who would say modern AI is a model of intelligence -- or is capable of modelling the world.

The word isnt a "constant conjunction of text tokens" -- even Hume wouldnt be this insane. Nevertheless, it is this lobotomised Hume we're dealing with.

There is a science now for how the mind comes to represent the world -- we do not need 18th C. crazy ideas. Insofar as they are presented as science, theyre pseudoscience

Thank you for sharing your opinion on Hume, but I don't see how e.g. Polyominoes, to take a random mathematical (ish) concept I was thinking about today, are connected to our body. I can think of many more examples. Geometry, trigonometry, algebra, calculus, the first order predicate calculus, etc. None of those seem to be connected to my body in any way.

Anyway this all is why I'm happy I'm not a philosopher. Philosophers deal in logic, but they don't have a machine that can calculate in logic, and keep them in the straight and narrow with its limited resources. A philosopher can say anything and imagine anything. A computer scientist -well, she can, but good luck making that happen on a computer.

I very disagree but have an upvote for a well-argued comment.

>> The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

A statistical model may "capture" explanatory relations, but can it use them? A data scientist showing a plot to a bunch of people is explaining something using a statistical model, so obviously the statistical model has some explanatory power. But it's the data scientist that is using the model as an explanation. I think the discussion is whether a statistical model can exist that doesn't just "capture" an explanation, but can also make use of that explanation like a human would, for example as background knowledge to build new explanations. That seems very far fetched: a statistical model that doesn't just model, but also introspects and has agency.

Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom? The big debate is that (allegedly) "we don't understand language models" in the first place. We have a giant corpus of incomprehensible data; we train a giant black box model on it; now we have a giant incomprehensible model of the data. What did we explain?

>> But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.

Let's call that model M* for clarity. The search space of models, let's call it S. There are any number of models in S that can generate many of the same sequences as M* without being M*. The question is, and has always been, in machine learning, how do we find M* in S, without being distracted by M_1, M_2, M_3, ..., ... that are not M*.

Given that we have a very limited way to test the capabilities of models, and that models are getting bigger and bigger (in machine learning anyway) which makes it harder and harder to get a good idea of what, exactly, they are modelling, how can we say which model we got a hold of?

>A statistical model may "capture" explanatory relations, but can it use them?

That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction. It's the difference between frequency counting while taking the past context as an opaque unit vs decomposing the past context and leveraging relevant tokens for generation while ignoring irrelevant ones. Self-attention does this by searching over all pairs of tokens in the context window for relevant associations. Induction heads[1] are a fully worked out example of this and help explain in-context learning in LLMs.

>Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom?

The model encodes explanatory relationships of phenomena in the world and it uses these relationships to successfully generalize its generation out-of-distribution. Basically, these models genuinely understand some things about the world. LLMs exhibit linguistic competence as it engages with subject matter to accurately respond to unseen variations in prompts of that subject matter. At least in some cases. I argue this point in some detail here[2].

>how can we say which model we got a hold of?

More sophisticated tests, ideally that can isolate exactly what was in the training data in comparison to what was generated. I think the example of the wide variety of poetry these models generate should strongly raise one's credence that they capture a sufficiently accurate model of poetry. I go into detail on this example in the link I mentioned. Aside from that, various ways of testing in-context learning can do a lot of work here[3].

[1] https://transformer-circuits.pub/2022/in-context-learning-an...

[2] https://www.reddit.com/r/naturalism/comments/1236vzf/

[3] https://twitter.com/leopoldasch/status/1638848881558704129

>> That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction.

That sentence should be decorated with the word "allegedly", or perhaps "conjecture"! In practical terms, I believe you are pointing out that language models of the GPT family are trained on a context surrounding, not just preceding, a predicted token. That's right (and it gets fudged in discussions about predicting the next token in a sequence), but we could already do that with skip-gram models, and with context-sensitive grammars, and dependency grammars, many years ago, and I don't remember anyone saying those were specially capable of capturing explanatory relations [1]. Although for grammars the claim could be made, since they are generally based on explanatory models of human language (but not because of context-sensitivity).

Anyway, I thought you were arguing that explanations are arbitrary, "explanatory posits", and wouldn't that mean that an explanation doesn't improve prediction? This is not to catch you in contradiction, I'm genuinely unsure about this myself. My understanding is that explanatory hypotheses improve predictions in the long run [2], but that's not to say that a predictive model will improve given explanations, rather explanatory models eventually replace strictly predictive models.

Are you saying that including explanations in training data can improve prediction? That would make sense, but this is very hard to do when training a predictive model on text. In that case, the explanations are at best hidden variables and language models are just not the right kind of model to model hidden variables.

Sorry, writing too much today. And I got work to do. So I won't bitch about "in-context learning" (what we used to call sampling from a model back in the day, three years ago before the GPT-3 paper :).

______________

[1] My Master's thesis was a bunch of language models trained on Howard Philips Lovecraft's complete works, and separately on a corpus of Magic: the Gathering cards. One of those models was a probabilistic Context-Free Grammar, and despite its context-freedom, and because it was a Definite Clause Grammar, I could sample from it with input strings like "The X in the darkness with the Y in the Z of the S" and it would dutifully fill-in the blanks with tokens that maximised the probability of the sentence. So even my puny PCFG could represent bi-directional context, after a fashion. Yet I wouldn't ever accuse it of being explanatory. Although I would say it was quite mad, given the corpus.

[2] I mention in another comment my favourite example of the theory of epicylces compared to Kepler's laws of planetary motion.

>Anyway, I thought you were arguing that explanations are arbitrary, "explanatory posits", and wouldn't that mean that an explanation doesn't improve prediction?

I don't mean to say that explanations are arbitrary, rather that causes are not observed only inferred. But we infer causes because of the explanatory work they do. This isn't arbitrary, it is strongly constrained by predictive value as well as, I'm not sure what to call it, epistemic coherence and intelligibility maybe? Explanatory models are satisfying because they allow us to derive many phenomena from fewer assumptions. Good explanatory models are mutually reinforcing and have a high level of coherence among assumptions ("epistemic coherence"). They also require the fewest number of assumptions taken as brute without further justification ("intelligibility").

Why think explanatory models are better at prediction? Because the mutual coherence among assumptions and explanatory power of the whole (ability to predict much from few assumptions) suggests the explanatory model is getting at the productive features of the phenomena that result in the observed behavior. Essentially, the fewer number of posits, the fewer ways to "bake in" the data into the model. If we were to cast this as a computational problem, i.e. find a program that reproduces the data, shorter programs are necessarily more explanatory. There's no other way to explain the coincidence of program picked out of a small space generating data picked out of a very large space without there being an explanatory relation between the two. Further, our credence for explanation increases as the ratio of the respective spaces diverge.

This is really the problem of machine learning in a nutshell. Is the data vs parameter count over some threshold such that training is biased towards explanatory relations? Is the model biased in the right way to discover these relations faster than it can memorize the data? LLMs seem to have crossed this threshold because of the massive amount of data they are trained on, seemingly much larger than can comfortably be memorized, and the inductive biases of Transformers that search the space of models to extract explanatory relations.

>Are you saying that including explanations in training data can improve prediction? That would make sense, but this is very hard to do when training a predictive model on text. In that case, the explanations are at best hidden variables and language models are just not the right kind of model to model hidden variables.

I agree with this, and I think these explanatory relations are implicit in human text. I gave the example in another comment that I say things like "I picked my cup off the floor" rather than "I picked my cup off the ceiling" because causal relations in the real world influence the text we write. The relation of "things fall down" is widely explanatory. But it seems to me that LLMs are very much general modelers of hidden variables, given the wide applicability of LLMs in areas that aren't strictly related to natural language. But then again, any structured data is a language in a broad sense. And the grammar can be arbitrarily complex and so can encode deep relationships among data in any domain. Personally, I'm not so surprised that a "language model" has such wide applicability.

>> Why think explanatory models are better at prediction? Because the mutual coherence among assumptions and explanatory power of the whole (ability to predict much from few assumptions) suggests the explanatory model is getting at the productive features of the phenomena that result in the observed behavior. Essentially, the fewer number of posits, the fewer ways to "bake in" the data into the model. If we were to cast this as a computational problem, i.e. find a program that reproduces the data, shorter programs are necessarily more explanatory. There's no other way to explain the coincidence of program picked out of a small space generating data picked out of a very large space without there being an explanatory relation between the two. Further, our credence for explanation increases as the ratio of the respective spaces diverge.

Like you say, that's the problem of machine learning. There's a huge space of hypotheses many of whom can fit the data, but how do we choose one that also fits unseen data? Explanatory models are easier to trust and trust that they will generalise better, because we can "see" why they would.

But the problem with LLMs is that they remain black boxes. If those black boxes are explanatory models, then to whom is the explanation, explained? Who is there to look at the explanation, and trust the predictions? This is what I can't see and I think it turns into a "turtles all the way down" kind of situation. Unless there is a human mind, somewhere in the process, that can look at the explanatory model and use the explanation to explain some observation, then I don't see how the model can really be said to be explanatory. Explanatory- to whom?

>> But it seems to me that LLMs are very much general modelers of hidden variables, given the wide applicability of LLMs in areas that aren't strictly related to natural language.

Well, I don't know. Maybe we'll find that's the case. For the time being I'm trying to keep an open mind, despite all the noise.