|
|
|
|
|
by hackinthebochs
7 days ago
|
|
>In other words, a Markov chain and a Transformer model are exactly equivalent in power Nonsense. Markov chains treat the past context as a single unit, an N-tuple with no internal structure. LLMs leverage the internal structure of the context which allows a large class of generalization that Markov chains necessarily miss. |
|
Both are a lookup table whose key is the entire context window and whose value is a probability distribution for what the next token should be.
You can say the choice of probability distribution in the value is "leveraging the internal structure of the context" or not, but the same tokens in two different orders are two different lookup keys and saying it's impossible to achieve some result with a Markov chain is factually incorrect.
https://arxiv.org/pdf/2410.02724 describes the equivalence formally.