Hacker News new | ask | show | jobs
by mjburgess 1167 days ago
Well this is a good reply, but it's mistaken.

The conditional probability:

    P(x[0]| x[-1], x[-2], x[-3] ...)
is not the same as,

    P(x[0] | x[-1], x[-2], ... -> x[0])
Where `->` says we select only those cases where x[-1],... brought-about x[0].

To see why this is the case, suppose we do have a god's eye-view of all of spacetime.

    P(A|B) 
    always selects for all instances where B follows A.

    P(A| B -> A) 
    selects only those instances where B's following A was caused by A.
Eg.,

    P(ShoesWet | Raining) 

    is very different from 

    P(ShoesWet | Raining -> ShoesWet)
in the former case the two events have, in general, nothing to do with each other.

To select "Raining -> ShoesWet" even with a gods-eye-view we need more than statistics... since those events which count as "Rain -> ShoeWet" have to be selected on a non-statistical basis.

For the athelete catching a ball, or the scientist designing the experiment, we're interested only in those causal cases.

For sure P(A|B) is a (approximate, statistical) model of P(A| B->A) -- but it's a very restricted, limited model.

The athlete needs to estimate P(ball-stops | catch -> ball-stops)

NOT P(ball-stops | catch) which is just any case of the ball-stopping given any case of catching.

2 comments

Let me alter your example a bit: we have P(A|B), we want P(A|B,B->A). But given enough examples of the form P(A|B), a good algorithm can deduce B->A and use it going forward to predict A. How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases. LLMs do this with self-attention, by taking every pair of symbols in the context window and testing whether each pair is useful in determining the next token. As the attention matrix converges, the model can leverage the presence of "Raining & Outside" in predicting "ShoesWet".

Of course, this is a rather poor excuse for an explanation. The fact that "outside" and "raining" are close doesn't explain why "my shoes are wet". But it does get us closer to a genuine explanation in the sense that it eliminates a class of wrong possibilities from consideration: every sentence that doesn't have outside in proximity to raining downranks the generation "my shoes are wet". The model is further improved by adding more inductive relationships of this sort. For example, the presence of an expanded umbrella downranks ShoesWet, the presence of "stepped in puddle" upranks it. Construct about a billion of these kinds of inductive relationships, and you end up with something analogous to an explanatory model. The structural relationships encoded in the many attention matrices in modern LLMs in aggregate entail the explanatory relationships needed for causal modelling.

> How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases.

But the machine doesn't know which are the right cases. We aren't presuming there's a column, Z = 1 for B -> A, and Z = 0 otherwise -- right?

The machine has no mechanism to distinguish these cases.

> testing whether each pair is useful in determining the next token

This isnt causation.

> every sentence that doesn't have outside in proximity to raining downranks the generation

So long as the sequential structure of sentences corresponds to the causal structure of the world: but that's kinda insane right?

We haven't rigged human language so that the distribution of tokens is the causal structure of the world. The reason text generated by LLMs appears meaningful is because we understand it. The actual structure of text generated isnt "via" a model of the world.

(Consider, for example, training an LLM on a dead untranslated language -- it's output is incomprehensible, and its weights are abitarily correlated with anything we care to choose.)

Nevertheless, given our choice of token, you do have a model which says:

    P(ShoesWet|~Rain) < P(ShoesWet | Rain) < P(ShoesWet|Rain & Outside)
That's true. But we're choosing these additional conjunctions because we already know the causal model; these conjunctions are how we're eliminating confounders to get an approximation close to the actual.

(Which you'll never get, the actual value is `1`. Iff A -> B, then P(A|B->A) = 1 -- this is a deductive inference necessary for ordinary science to take place).

In any case, P(A | B -> A) means without any confounders. To actually find the LLM's approximation of this we'd need to compute:

    P(A|B & C1 & C2 & C3 ...)  forall C_i..inf
And then find P(A|B & C') st. C' made P(A|B) maximally likely.

If you find a set of {C} st. P(A|B) has a high probability, you won't find causal conditions.

All that statistical association models here is, at best, salience -- not causal relevance.

>We haven't rigged human language so that the distribution of tokens is the causal structure of the world [...] The actual structure of text generated isnt "via" a model of the world.

This is an odd claim. I certainly say that I picked my cup off the floor rather than I picked my cup off the ceiling because gravity causes things to fall down rather than up. Human language isn't "rigged" to represent the causal structure of the world, but it does nonetheless. The distribution of tokens is such that the occurrence of (A,B) and (B,A) are asymmetric, and this is precisely because of features of the world influence the distribution of words we use. A sufficiently strong model should be able to recover a model of this causal structure given enough training data.

>That's true. But we're choosing these additional conjunctions because we already know the causal model; these conjunctions are how we're eliminating confounders to get an approximation close to the actual.

But these patterns are represented in the training data by the words we use to discuss raining and wet shoes. There is every reason to think a strong model will recover this regularity.

>All that statistical association models here is, at best, salience -- not causal relevance.

That's all we can ever get from sensory experience. We infer causation because it is more explanatory than accepting a huge network of asymmetric correlations as brute. YeGoblynQueenne is right that my point is basically a version of the problem of induction. We can infer causation but we are never witness to causation. We do not build causal models, we build models of asymmetric correlations and infer causation from the success of our models. What a good statistical model does is not different in kind.

The problem of induction is fatal. But we overcome it: we do witness causation.

When I act on the world, with my body, I take as a given "Body -> Action". We witness causation in our every action.

> This is an odd claim

The tokens can be given any meaning. The statistical distribution of token frequencies in our languages have an infinite number of causal semantics which are consistent with them.

We can find arbitary patterns such that

    P(A) < P(A|B) < P(A|B & C) < P(A|B & C...)
Only those we give a semantics to ("Rain" = Rain), and only those we already know are causal we will count. This is the trick of humans reading the output of LLMs -- this is what makes it possible. It's essentially one big Eliza effect.

No, the structure of language isnt the structure of the world.

This pattern in tokens,

    P(A) < P(A|B) < P(A|B & C) < P(A|B & C...)

Is an associative statistical model of conditional aggregate salience between token terms.

Phrase any such conditional probability you wish, it will never select for causal patterns.

this is why we experiment. It's why we act on the world to change it.

When the child burns their hand on the fireplace they do so once. Why?

Because the child immediately infers,

    P(TouchFire -> Pain | MoveHand -> Pain) = 1
How? via the abduction, roughly:

    P( TouchFire | Desire_TouchFire -> TouchFire) = 1
how?

    P( TouchFire | Desire_TouchFire -> MoveHand) = 1
how?

    P( Pain | MoveHand -> TouchFire -> Pain) = 1
etc.

In other words, we bottom out our reasoning in a

    P( BodilyMovement -> Effect | Desire -> BodilyMovement) = 1

Absent this, absent being in the world with a body, you cannot determine causes.

The problem of induction phrased in modern language is this: statistics isn't informative. Or, conditional probabilities are no route to knowledge. Or, AI is dumb.

Wow, that's a nice way to put it. I haven't seen that P(A|B -> A) notation before. Where does it come from?

But I think the OP is arguing, essentially, that P(A|B -> A) is only an interpretation of P(A|B) that we have chosen, among the many possible interpretations of P(A|B).

Which I think evokes the problem of induction. How do we know that P(A| B -> A) when all we can observe ourselves is P(A|B)?

> when all we can observe ourselves is P(A|B)?

No, we actually observe P(A | B -> A) where `B` is our body and `A` is some action we take on the world.

Hume was WRONG. Very wrong.

Statistical AI has the problem of induction; we have bodies, so we do not.

----

As for notation, I'm riffing of Judea Pearl's do notation.

He'd say, P(A|do(B))

but his `do` operator is slightly more general

Google: do-operator, causal analysis, judea pearl, etc.

Ah, I thought it might be something to do with Judea Pearl.

>> Hume was WRONG. Very wrong.

Oh boy :)

I can see what you're saying about having bodies, but bodies are very limited things and that's just making Hume's point. We can only know so much by experiencing it with our bodies. We've learned a lot more about the world, and its foundations, thanks to our ability to draw inferences without having to engage our bodies. For example, all of mathematics, including logic that studies inference, is "things we do without having to engage our bodies". And those very things have shown us the limits of our own abilities, or at least our ability to create formal systems that can describe the world in its entirety. They have shown us the limits of our ability for inductive inference (and in a very concrete manner - see Mark E. Gold's Language Identification in the Limit).

Machine learning systems are more limited than ourselves, that's right. And that's because we have created them, and we are limited beings that cannot know the entirety of the world just by looking at it, or reasoning about it.

One of the premises of hume's sceptical metaphysics was that

    P(A|B) is just P(A | B -> A) 
The argument for this was `A` and `B` are only "Ideas in the head" and don't refer to a world. And secondly, by assertion, that Ideas are "thin" pictorial phenomena that can only be sequenced.

Hume here is just wrong. Our terms refer: `A` succeeds in referring to, eg., Rain. And our experiences aren't "thin", they're "thick" -- this was Kant's point. Our experiences play a rich role in inference that cannot be played by "pictures".

To have a metal representation R of the world is to have a richly structured interpretation which does, in fact, contain and express causation.

ie., R can quite easily be a mental representation of "B -> A". This, after all, is what we are thinking when we think about the rain hitting our shoes. We do not imagine P(A|B), we imagine P(A|B->A) -- if we didnt, we couldn't reason about the future.

The question is only how we obtain such representations, and the answer is: the body with its intrinsic known causal structure.

Whenever we need to go beyond the body, we invent tools to do so -- and connect the causal properties of those tools to our body.

Hume here is wrong in every respect. And it's his extreme scepticism which undergirds all those who would say modern AI is a model of intelligence -- or is capable of modelling the world.

The word isnt a "constant conjunction of text tokens" -- even Hume wouldnt be this insane. Nevertheless, it is this lobotomised Hume we're dealing with.

There is a science now for how the mind comes to represent the world -- we do not need 18th C. crazy ideas. Insofar as they are presented as science, theyre pseudoscience

Thank you for sharing your opinion on Hume, but I don't see how e.g. Polyominoes, to take a random mathematical (ish) concept I was thinking about today, are connected to our body. I can think of many more examples. Geometry, trigonometry, algebra, calculus, the first order predicate calculus, etc. None of those seem to be connected to my body in any way.

Anyway this all is why I'm happy I'm not a philosopher. Philosophers deal in logic, but they don't have a machine that can calculate in logic, and keep them in the straight and narrow with its limited resources. A philosopher can say anything and imagine anything. A computer scientist -well, she can, but good luck making that happen on a computer.

Well Kant (Chomsky, et al.) are probably right that we must have innate concepts -- esp. causation, geometry, linguistic primitives etc. in order to be able to perceive at all.

So in this sense a minimal set of a-priori concepts are required to be built-in, or else we couldn't learn anything at all.

You might say that this means we can separate the sensory-motor genesis of concepts from their content -- but I think this only applies to a-priori ones.

What i'm talking about is conceptualisations of our environment that provide its causal structure. One important aspect of that is how desires (via goals) change the world. Another is how the world itself works.

Both of these do require a body, or at least a solution to the problem of induction (ie., that P(A|B) is consistent with P(A|B->A), P(A|~(B->A_), P(A| B->Z, C->Z, Z->A), etc.)