| I think what this argument is missing is the emergent properties of the LLM. In order to “predict the next word”, the LLM doesn’t just learn the most likely word from a corpus for the preceding string. If that were true, it would not generalise outside of its training set. The LLM learns about the structure of the language, the context, and in the process of doing so constructs a model of the world as represented by words. Admittedly the model is still limited, but it seems to me that there is something more insightful to be gleaned here: that given enough data, and sufficient pressure to learn, that excelling at scale on a relatively simple task leads indirectly to a form of intelligence. For me the biggest takeaway of LLMs might be that “intelligence is pretty cheap, actually” and that the human brain is not so remarkable as we’d like to believe. |
So technically the LLM is not doing P(next word |previous word) -- but rather, P(associated_words(next word)|assocated_words(previous), associated_words(previous_-1), ...).
This means its search space for each conditional step is still extremely large in the historical corpus, and there's more flexibility to reach "across and between contexts" -- but it isnt sensitive to context.. we just arranged the data that way.
Soon enough people with enough money will build diagnostic (XAI) models of LLMs that are powerful enough to show this process at work over its training data.
To visualize roughly, imagine you're in a library and you're asked a question. The first word selects a very large number of pages across many books (and whole books), the second word selects both other books, and pages across the books you have. Keep going.. each more word you're ask, you convert to a set of words, and find more pages and books and also get narrower paragraph samples from the ones you have. Now finally, with total set of pages and paragraphs etc. you have to hand at the end of the question, you then find the word most probable following the other.
This process will eventually be visualised properly, with a real-world LLM, but it'll take a significant investement to build this sort of explanatory model.. since you need to reverse from weights to training data across the entire inference process.