| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fenomas 1202 days ago
	Respectfully, that sounds like hand-waving. Claiming to know where concepts do and don't come from just leads to questions like "did the natural numbers exist before we did?", which are centuries old and presumably not resolvable. Whereas a more focused question like "can an AI produce outputs that are novel to someone familiar with all of the AI's inputs?" seems resolvable, and even if one thinks it's unlikely or not easy, it's very hard to buy the idea that it's impossible.

1 comments

mjburgess 1202 days ago

> just leads to questions

No, not really. People in this area are severely poorly informed on animal learning, and "ordinary science".

AI evangelists like to treat as "merely philosophical matters" profoundly scientific ones.

The issues here belong to ordinary science. Can a machine with access only to statistical patterns in the distribution of text tokens infer the physical structure of reality?

We can say, as certain as anything: No.

Associative statistical models are not phenomenological models (ie., specialised to observable cause-effect measures); and phenomenological models are not causal (ie., do not give the mechanism of the cause-effect relationship).

Further, we know as surely as an athlete catching a ball, that animals develop causal models of their environments "deeply and spontaneously".

And we know, to quite a robust degree, how they do so -- using interior causal models of their bodies to change their environments by intentional acts can confirm or disconfirm environmental models. This is modelled logically as abduction, causally as sensory-motor adaption, and so on.

This is not a philosophical matter. We know that "statistical learning" which is nothing more than a "correlation maximisation objective" over non-phenomenological, non-causal, non-physical data produces approximate associative models of those target domains -- that have little use beyond "replaying those associations".

ChatGPT appears to do many things. But you will see soon, after a year or two of papers published, that those things were tricks. That "replaying associations in everything ever written" is a great trick, that is very useful to people.

Today you can ask ChatGPT to rewrite harry potter "if harry were evil" or some such thing. That's because there are many libraries of books on harry potter and "evil" -- and by statistical interpolation alone, you can answer an apparent counter-factual question which should require imagination.

But give ChatGPT an actual counter-factual whose parts are only in the question, and you'll be out-of-luck.

Eg., tell it about tables, chairs, pens, cups and ask it to arrange them using given operations so that, eg., the room is orderly. Or whatever you wish.

Specified precisely enough you can expose the trick.

hackinthebochs 1202 days ago

>This is not a philosophical matter. We know that "statistical learning" which is nothing more than a "correlation maximisation objective" over non-phenomenological, non-causal, non-physical data produces approximate associative models of those target domains -- that have little use beyond "replaying those associations".

Why do you think the data LLMs are trained on are non-causal? Lets take causation as asymmetric correlation. That is, (A,B) present in the training data does not imply (B,A) presence. But of course human text is asymmetric in this manner and LLMs will pick up on this asymmetry. You might say that causation isn't merely about asymmetric correlation, but that of the former determining the latter. But this isn't something we observe from nature, it is an explanatory posit that humans have landed on in service to modelling the world. So causation is intrinsically explanatory, and explanation is intrinsically causal. The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

Cashing out explanation and explanatory model isn't easy. But as a first pass I can say that explanatory models capture intrinsic regularity of a target system such that the model has an analogical relationship with internal mechanisms in the target system. This means that certain transformations applied to the target system has a corresponding transformation in the model that identifies the same outcome. If we view phenomena in terms of mechanistic levels with the extrinsic observable properties as the top level and the internal mechanisms as lower levels, an explanatory model will model some lower mechanistic level and recover properties of the top level.

But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.

mjburgess 1202 days ago

Well this is a good reply, but it's mistaken.

The conditional probability:

    P(x[0]| x[-1], x[-2], x[-3] ...)

is not the same as,

    P(x[0] | x[-1], x[-2], ... -> x[0])

Where `->` says we select only those cases where x[-1],... brought-about x[0].

To see why this is the case, suppose we do have a god's eye-view of all of spacetime.

    P(A|B) 
    always selects for all instances where B follows A.

    P(A| B -> A) 
    selects only those instances where B's following A was caused by A.

Eg.,

    P(ShoesWet | Raining) 

    is very different from 

    P(ShoesWet | Raining -> ShoesWet)

in the former case the two events have, in general, nothing to do with each other.

To select "Raining -> ShoesWet" even with a gods-eye-view we need more than statistics... since those events which count as "Rain -> ShoeWet" have to be selected on a non-statistical basis.

For the athelete catching a ball, or the scientist designing the experiment, we're interested only in those causal cases.

For sure P(A|B) is a (approximate, statistical) model of P(A| B->A) -- but it's a very restricted, limited model.

The athlete needs to estimate P(ball-stops | catch -> ball-stops)

NOT P(ball-stops | catch) which is just any case of the ball-stopping given any case of catching.

hackinthebochs 1202 days ago

Let me alter your example a bit: we have P(A|B), we want P(A|B,B->A). But given enough examples of the form P(A|B), a good algorithm can deduce B->A and use it going forward to predict A. How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases. LLMs do this with self-attention, by taking every pair of symbols in the context window and testing whether each pair is useful in determining the next token. As the attention matrix converges, the model can leverage the presence of "Raining & Outside" in predicting "ShoesWet".

Of course, this is a rather poor excuse for an explanation. The fact that "outside" and "raining" are close doesn't explain why "my shoes are wet". But it does get us closer to a genuine explanation in the sense that it eliminates a class of wrong possibilities from consideration: every sentence that doesn't have outside in proximity to raining downranks the generation "my shoes are wet". The model is further improved by adding more inductive relationships of this sort. For example, the presence of an expanded umbrella downranks ShoesWet, the presence of "stepped in puddle" upranks it. Construct about a billion of these kinds of inductive relationships, and you end up with something analogous to an explanatory model. The structural relationships encoded in the many attention matrices in modern LLMs in aggregate entail the explanatory relationships needed for causal modelling.

mjburgess 1202 days ago

> How? By searching over the space of explanatory models to find the model that helps to predict P(A|B) in the right cases and not in the wrong cases.

But the machine doesn't know which are the right cases. We aren't presuming there's a column, Z = 1 for B -> A, and Z = 0 otherwise -- right?

The machine has no mechanism to distinguish these cases.

> testing whether each pair is useful in determining the next token

This isnt causation.

> every sentence that doesn't have outside in proximity to raining downranks the generation

So long as the sequential structure of sentences corresponds to the causal structure of the world: but that's kinda insane right?

We haven't rigged human language so that the distribution of tokens is the causal structure of the world. The reason text generated by LLMs appears meaningful is because we understand it. The actual structure of text generated isnt "via" a model of the world.

(Consider, for example, training an LLM on a dead untranslated language -- it's output is incomprehensible, and its weights are abitarily correlated with anything we care to choose.)

Nevertheless, given our choice of token, you do have a model which says:

    P(ShoesWet|~Rain) < P(ShoesWet | Rain) < P(ShoesWet|Rain & Outside)

That's true. But we're choosing these additional conjunctions because we already know the causal model; these conjunctions are how we're eliminating confounders to get an approximation close to the actual.

(Which you'll never get, the actual value is `1`. Iff A -> B, then P(A|B->A) = 1 -- this is a deductive inference necessary for ordinary science to take place).

In any case, P(A | B -> A) means without any confounders. To actually find the LLM's approximation of this we'd need to compute:

    P(A|B & C1 & C2 & C3 ...)  forall C_i..inf

And then find P(A|B & C') st. C' made P(A|B) maximally likely.

If you find a set of {C} st. P(A|B) has a high probability, you won't find causal conditions.

All that statistical association models here is, at best, salience -- not causal relevance.

hackinthebochs 1202 days ago

>We haven't rigged human language so that the distribution of tokens is the causal structure of the world [...] The actual structure of text generated isnt "via" a model of the world.

This is an odd claim. I certainly say that I picked my cup off the floor rather than I picked my cup off the ceiling because gravity causes things to fall down rather than up. Human language isn't "rigged" to represent the causal structure of the world, but it does nonetheless. The distribution of tokens is such that the occurrence of (A,B) and (B,A) are asymmetric, and this is precisely because of features of the world influence the distribution of words we use. A sufficiently strong model should be able to recover a model of this causal structure given enough training data.

>That's true. But we're choosing these additional conjunctions because we already know the causal model; these conjunctions are how we're eliminating confounders to get an approximation close to the actual.

But these patterns are represented in the training data by the words we use to discuss raining and wet shoes. There is every reason to think a strong model will recover this regularity.

>All that statistical association models here is, at best, salience -- not causal relevance.

That's all we can ever get from sensory experience. We infer causation because it is more explanatory than accepting a huge network of asymmetric correlations as brute. YeGoblynQueenne is right that my point is basically a version of the problem of induction. We can infer causation but we are never witness to causation. We do not build causal models, we build models of asymmetric correlations and infer causation from the success of our models. What a good statistical model does is not different in kind.

YeGoblynQueenne 1202 days ago

Wow, that's a nice way to put it. I haven't seen that P(A|B -> A) notation before. Where does it come from?

But I think the OP is arguing, essentially, that P(A|B -> A) is only an interpretation of P(A|B) that we have chosen, among the many possible interpretations of P(A|B).

Which I think evokes the problem of induction. How do we know that P(A| B -> A) when all we can observe ourselves is P(A|B)?

mjburgess 1202 days ago

> when all we can observe ourselves is P(A|B)?

No, we actually observe P(A | B -> A) where `B` is our body and `A` is some action we take on the world.

Hume was WRONG. Very wrong.

Statistical AI has the problem of induction; we have bodies, so we do not.

----

As for notation, I'm riffing of Judea Pearl's do notation.

He'd say, P(A|do(B))

but his `do` operator is slightly more general

Google: do-operator, causal analysis, judea pearl, etc.

YeGoblynQueenne 1201 days ago

Ah, I thought it might be something to do with Judea Pearl.

>> Hume was WRONG. Very wrong.

Oh boy :)

I can see what you're saying about having bodies, but bodies are very limited things and that's just making Hume's point. We can only know so much by experiencing it with our bodies. We've learned a lot more about the world, and its foundations, thanks to our ability to draw inferences without having to engage our bodies. For example, all of mathematics, including logic that studies inference, is "things we do without having to engage our bodies". And those very things have shown us the limits of our own abilities, or at least our ability to create formal systems that can describe the world in its entirety. They have shown us the limits of our ability for inductive inference (and in a very concrete manner - see Mark E. Gold's Language Identification in the Limit).

Machine learning systems are more limited than ourselves, that's right. And that's because we have created them, and we are limited beings that cannot know the entirety of the world just by looking at it, or reasoning about it.

YeGoblynQueenne 1202 days ago

I very disagree but have an upvote for a well-argued comment.

>> The question is, does an LLM in the course of modelling asymmetric correlations, develop something analogous to an explanatory model. I think so, in the sense that a good statistical model will intrinsically capture explanatory relations.

A statistical model may "capture" explanatory relations, but can it use them? A data scientist showing a plot to a bunch of people is explaining something using a statistical model, so obviously the statistical model has some explanatory power. But it's the data scientist that is using the model as an explanation. I think the discussion is whether a statistical model can exist that doesn't just "capture" an explanation, but can also make use of that explanation like a human would, for example as background knowledge to build new explanations. That seems very far fetched: a statistical model that doesn't just model, but also introspects and has agency.

Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom? The big debate is that (allegedly) "we don't understand language models" in the first place. We have a giant corpus of incomprehensible data; we train a giant black box model on it; now we have a giant incomprehensible model of the data. What did we explain?

>> But this is in the solution space of good models of statistical regularity of an external system. To maximally predict the next token in a sequence just requires a model of the process that generates that sequence.

Let's call that model M* for clarity. The search space of models, let's call it S. There are any number of models in S that can generate many of the same sequences as M* without being M*. The question is, and has always been, in machine learning, how do we find M* in S, without being distracted by M_1, M_2, M_3, ..., ... that are not M*.

Given that we have a very limited way to test the capabilities of models, and that models are getting bigger and bigger (in machine learning anyway) which makes it harder and harder to get a good idea of what, exactly, they are modelling, how can we say which model we got a hold of?

hackinthebochs 1202 days ago

>A statistical model may "capture" explanatory relations, but can it use them?

That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction. It's the difference between frequency counting while taking the past context as an opaque unit vs decomposing the past context and leveraging relevant tokens for generation while ignoring irrelevant ones. Self-attention does this by searching over all pairs of tokens in the context window for relevant associations. Induction heads[1] are a fully worked out example of this and help explain in-context learning in LLMs.

>Anyway I find it very hard to think of language models as explanatory models. They're predictive models, they are black boxes, they model language, but what do they explain? And to whom?

The model encodes explanatory relationships of phenomena in the world and it uses these relationships to successfully generalize its generation out-of-distribution. Basically, these models genuinely understand some things about the world. LLMs exhibit linguistic competence as it engages with subject matter to accurately respond to unseen variations in prompts of that subject matter. At least in some cases. I argue this point in some detail here[2].

>how can we say which model we got a hold of?

More sophisticated tests, ideally that can isolate exactly what was in the training data in comparison to what was generated. I think the example of the wide variety of poetry these models generate should strongly raise one's credence that they capture a sufficiently accurate model of poetry. I go into detail on this example in the link I mentioned. Aside from that, various ways of testing in-context learning can do a lot of work here[3].

[1] https://transformer-circuits.pub/2022/in-context-learning-an...

[2] https://www.reddit.com/r/naturalism/comments/1236vzf/

[3] https://twitter.com/leopoldasch/status/1638848881558704129

YeGoblynQueenne 1201 days ago

>> That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction.

That sentence should be decorated with the word "allegedly", or perhaps "conjecture"! In practical terms, I believe you are pointing out that language models of the GPT family are trained on a context surrounding, not just preceding, a predicted token. That's right (and it gets fudged in discussions about predicting the next token in a sequence), but we could already do that with skip-gram models, and with context-sensitive grammars, and dependency grammars, many years ago, and I don't remember anyone saying those were specially capable of capturing explanatory relations [1]. Although for grammars the claim could be made, since they are generally based on explanatory models of human language (but not because of context-sensitivity).

Anyway, I thought you were arguing that explanations are arbitrary, "explanatory posits", and wouldn't that mean that an explanation doesn't improve prediction? This is not to catch you in contradiction, I'm genuinely unsure about this myself. My understanding is that explanatory hypotheses improve predictions in the long run [2], but that's not to say that a predictive model will improve given explanations, rather explanatory models eventually replace strictly predictive models.

Are you saying that including explanations in training data can improve prediction? That would make sense, but this is very hard to do when training a predictive model on text. In that case, the explanations are at best hidden variables and language models are just not the right kind of model to model hidden variables.

Sorry, writing too much today. And I got work to do. So I won't bitch about "in-context learning" (what we used to call sampling from a model back in the day, three years ago before the GPT-3 paper :).

______________

[1] My Master's thesis was a bunch of language models trained on Howard Philips Lovecraft's complete works, and separately on a corpus of Magic: the Gathering cards. One of those models was a probabilistic Context-Free Grammar, and despite its context-freedom, and because it was a Definite Clause Grammar, I could sample from it with input strings like "The X in the darkness with the Y in the Z of the S" and it would dutifully fill-in the blanks with tokens that maximised the probability of the sentence. So even my puny PCFG could represent bi-directional context, after a fashion. Yet I wouldn't ever accuse it of being explanatory. Although I would say it was quite mad, given the corpus.

[2] I mention in another comment my favourite example of the theory of epicylces compared to Kepler's laws of planetary motion.

hackinthebochs 1201 days ago

>Anyway, I thought you were arguing that explanations are arbitrary, "explanatory posits", and wouldn't that mean that an explanation doesn't improve prediction?

I don't mean to say that explanations are arbitrary, rather that causes are not observed only inferred. But we infer causes because of the explanatory work they do. This isn't arbitrary, it is strongly constrained by predictive value as well as, I'm not sure what to call it, epistemic coherence and intelligibility maybe? Explanatory models are satisfying because they allow us to derive many phenomena from fewer assumptions. Good explanatory models are mutually reinforcing and have a high level of coherence among assumptions ("epistemic coherence"). They also require the fewest number of assumptions taken as brute without further justification ("intelligibility").

Why think explanatory models are better at prediction? Because the mutual coherence among assumptions and explanatory power of the whole (ability to predict much from few assumptions) suggests the explanatory model is getting at the productive features of the phenomena that result in the observed behavior. Essentially, the fewer number of posits, the fewer ways to "bake in" the data into the model. If we were to cast this as a computational problem, i.e. find a program that reproduces the data, shorter programs are necessarily more explanatory. There's no other way to explain the coincidence of program picked out of a small space generating data picked out of a very large space without there being an explanatory relation between the two. Further, our credence for explanation increases as the ratio of the respective spaces diverge.

This is really the problem of machine learning in a nutshell. Is the data vs parameter count over some threshold such that training is biased towards explanatory relations? Is the model biased in the right way to discover these relations faster than it can memorize the data? LLMs seem to have crossed this threshold because of the massive amount of data they are trained on, seemingly much larger than can comfortably be memorized, and the inductive biases of Transformers that search the space of models to extract explanatory relations.

>Are you saying that including explanations in training data can improve prediction? That would make sense, but this is very hard to do when training a predictive model on text. In that case, the explanations are at best hidden variables and language models are just not the right kind of model to model hidden variables.

I agree with this, and I think these explanatory relations are implicit in human text. I gave the example in another comment that I say things like "I picked my cup off the floor" rather than "I picked my cup off the ceiling" because causal relations in the real world influence the text we write. The relation of "things fall down" is widely explanatory. But it seems to me that LLMs are very much general modelers of hidden variables, given the wide applicability of LLMs in areas that aren't strictly related to natural language. But then again, any structured data is a language in a broad sense. And the grammar can be arbitrarily complex and so can encode deep relationships among data in any domain. Personally, I'm not so surprised that a "language model" has such wide applicability.

fenomas 1202 days ago

> Can a machine with access only to statistical patterns in the distribution of text tokens infer the physical structure of reality? We can say, as certain as anything: No.

Um. How do you square that claim with the well-known Othello paper?

https://thegradient.pub/othello/

mjburgess 1202 days ago

The board state can be phrased as moves. This paper profoundly misunderstands the problem.

The issue isn't that associative statistical models of domain Z aren't also models of domain Y where Y = f(Z) -- this is obvious.

Rather there are two problems, (1) the modal properties of these models arent right; and (2) they don't work where the target domain isn't just a rephrasing of the training domain.