|
|
|
|
|
by Borealid
10 days ago
|
|
No, not nonsense. Both are a lookup table whose key is the entire context window and whose value is a probability distribution for what the next token should be. You can say the choice of probability distribution in the value is "leveraging the internal structure of the context" or not, but the same tokens in two different orders are two different lookup keys and saying it's impossible to achieve some result with a Markov chain is factually incorrect. https://arxiv.org/pdf/2410.02724 describes the equivalence formally. |
|
>but the same tokens in two different orders are two different lookup keys
This is necessarily true for Markov chains and not necessarily true for Transformers. Transformers learn invariance over certain kinds of semantically irrelevant transformations. The Markov chain simply has to learn each input variant independently, resulting in an explosion of state space and data requirements compared to the functionally equivalent transformer. Expressive power matters.
I really don't get people's love for saying X is "just" Y (it's just a Markov chain, it's just a Kernel method). It's a strange pathology to focus on the superficial similarity while downplaying the boost in expressive power from where the models diverge.