|
>> That's the beauty of autoregressive training, the model is rewarded for capturing and utilizing explanatory relations because they have an outsized effect on prediction. That sentence should be decorated with the word "allegedly", or perhaps "conjecture"! In practical terms, I believe you are pointing out that language models of the GPT family are trained on a context surrounding, not just preceding, a predicted token. That's right (and it gets fudged in discussions about predicting the next token in a sequence), but we could already do that with skip-gram models, and with context-sensitive grammars, and dependency grammars, many years ago, and I don't remember anyone saying those were specially capable of capturing explanatory relations [1]. Although for grammars the claim could be made, since they are generally based on explanatory models of human language (but not because of context-sensitivity). Anyway, I thought you were arguing that explanations are arbitrary, "explanatory posits", and wouldn't that mean that an explanation doesn't improve prediction? This is not to catch you in contradiction, I'm genuinely unsure about this myself. My understanding is that explanatory hypotheses improve predictions in the long run [2], but that's not to say that a predictive model will improve given explanations, rather explanatory models eventually replace strictly predictive models. Are you saying that including explanations in training data can improve prediction? That would make sense, but this is very hard to do when training a predictive model on text. In that case, the explanations are at best hidden variables and language models are just not the right kind of model to model hidden variables. Sorry, writing too much today. And I got work to do. So I won't bitch about "in-context learning" (what we used to call sampling from a model back in the day, three years ago before the GPT-3 paper :). ______________ [1] My Master's thesis was a bunch of language models trained on Howard Philips Lovecraft's complete works, and separately on a corpus of Magic: the Gathering cards. One of those models was a probabilistic Context-Free Grammar, and despite its context-freedom, and because it was a Definite Clause Grammar, I could sample from it with input strings like "The X in the darkness with the Y in the Z of the S" and it would dutifully fill-in the blanks with tokens that maximised the probability of the sentence. So even my puny PCFG could represent bi-directional context, after a fashion. Yet I wouldn't ever accuse it of being explanatory. Although I would say it was quite mad, given the corpus. [2] I mention in another comment my favourite example of the theory of epicylces compared to Kepler's laws of planetary motion. |
I don't mean to say that explanations are arbitrary, rather that causes are not observed only inferred. But we infer causes because of the explanatory work they do. This isn't arbitrary, it is strongly constrained by predictive value as well as, I'm not sure what to call it, epistemic coherence and intelligibility maybe? Explanatory models are satisfying because they allow us to derive many phenomena from fewer assumptions. Good explanatory models are mutually reinforcing and have a high level of coherence among assumptions ("epistemic coherence"). They also require the fewest number of assumptions taken as brute without further justification ("intelligibility").
Why think explanatory models are better at prediction? Because the mutual coherence among assumptions and explanatory power of the whole (ability to predict much from few assumptions) suggests the explanatory model is getting at the productive features of the phenomena that result in the observed behavior. Essentially, the fewer number of posits, the fewer ways to "bake in" the data into the model. If we were to cast this as a computational problem, i.e. find a program that reproduces the data, shorter programs are necessarily more explanatory. There's no other way to explain the coincidence of program picked out of a small space generating data picked out of a very large space without there being an explanatory relation between the two. Further, our credence for explanation increases as the ratio of the respective spaces diverge.
This is really the problem of machine learning in a nutshell. Is the data vs parameter count over some threshold such that training is biased towards explanatory relations? Is the model biased in the right way to discover these relations faster than it can memorize the data? LLMs seem to have crossed this threshold because of the massive amount of data they are trained on, seemingly much larger than can comfortably be memorized, and the inductive biases of Transformers that search the space of models to extract explanatory relations.
>Are you saying that including explanations in training data can improve prediction? That would make sense, but this is very hard to do when training a predictive model on text. In that case, the explanations are at best hidden variables and language models are just not the right kind of model to model hidden variables.
I agree with this, and I think these explanatory relations are implicit in human text. I gave the example in another comment that I say things like "I picked my cup off the floor" rather than "I picked my cup off the ceiling" because causal relations in the real world influence the text we write. The relation of "things fall down" is widely explanatory. But it seems to me that LLMs are very much general modelers of hidden variables, given the wide applicability of LLMs in areas that aren't strictly related to natural language. But then again, any structured data is a language in a broad sense. And the grammar can be arbitrarily complex and so can encode deep relationships among data in any domain. Personally, I'm not so surprised that a "language model" has such wide applicability.