|
|
|
|
|
by edmara
778 days ago
|
|
The modelling is advanced enough that you can't fundamentally distinguish it from (lossy, limited) planning in the way you're describing. If the KQV doesn't encode information about likely future token sequences then a transformer empirically couldn't outperform Markov text generators. |
|
Though, more simply, you can just take any LLM and rephrase it as a markov model. All algorithms which model conditional probability are equivalent; you can even unpack a NN as a kNN model or a decision tree.
They all model 'planning' in the same way: P(C|A, B) is a 'plan' for C following A, B. There is no model of P("A B C" | "A B"). Literally, at inference time, no computation whatsoever is performed to anticipate any future prediction -- this follows both trivially form the mathematical formalism (which no one seems to want to understand); or you can also see this empirically: inference time is constant regardless of prompt/continuation.
The reason 'the cat sat...' is completed by 'on the mat' is that it's maximal that P(on|the cat sat...), P(the|the cat sat on...), P(mat|the cat sat on the...)
Why its maximal is not in the model at all, nor in the data. It's in the data generating process, ie., us. It is we who arranged text by these frequencies and we did so because the phrase is a popular one for academic demonstrations (and so on).
As ever, people attribute "to the data" or worse, "to the LLM" no properties it has.. rather it replays the data to us and we suppose the LLM must have the property that generates this data originally. Nope.
Why did the tape recorder say, "the cat sat on the mat"? What, on the tape or in the recorder made "mat" the right word? Surely, the tape must have planned the word...