|
|
|
|
|
by chpatrick
184 days ago
|
|
It's not n sometimes, k tokens some other times. LLMs have fixed context windows, you just sometimes have less text so it's not full. They're pure functions from a fixed size block of text to a probability distribution of the next character, same as the classic lookup table n gram Markov chain model. |
|
2. “Fixed-size block” is a padding detail, not a modeling assumption. Yes, implementations batch/pad to a maximum length. But the model is fundamentally conditioned on a variable-length prefix (up to the cap), and it treats position 37 differently from position 3,700 because the computation explicitly uses positional information. That means the conditional distribution is not a simple stationary “transition table” the way the n-gram picture suggests.
3. “Same as a lookup table” is exactly the part that breaks. A classic n-gram Markov model is literally a table (or smoothed table) from discrete contexts to next-token probabilities. A transformer is a learned function that computes a representation of the entire prefix and uses that to produce a distribution. Two contexts that were never seen verbatim in training can still yield sensible outputs because the model generalizes via shared parameters; that is categorically unlike n-gram lookup behavior.
I don't know how many times I have to spell this out for you. Calling LLMs markov chains is less than useless. They don't resemble them in any way unless you understand neither.