|
|
|
|
|
by hackinthebochs
16 days ago
|
|
That paper doesn't prove the equivalence of Transformers and Markov chains, it uses Markov chains as a theoretical model to understand the behavior of Transforms. The expressivity of the model matters, and Transformers just are more expressive than Markov chains. >but the same tokens in two different orders are two different lookup keys This is necessarily true for Markov chains and not necessarily true for Transformers. Transformers learn invariance over certain kinds of semantically irrelevant transformations. The Markov chain simply has to learn each input variant independently, resulting in an explosion of state space and data requirements compared to the functionally equivalent transformer. Expressive power matters. I really don't get people's love for saying X is "just" Y (it's just a Markov chain, it's just a Kernel method). It's a strange pathology to focus on the superficial similarity while downplaying the boost in expressive power from where the models diverge. |
|
Do you have some concrete example of a transformer that cannot be represented as a mapping from inputs to probability distribution of outputs?
I say they're equivalent because it is possible to losslessly convert one to the other by wasting massive amounts of disk space and time.
As a second example proving the point, imagine you sampled a transformer's output for a certain context 85 trillion times, and put the output token frequencies in a table. Repeat for all possible inputs (of which there are a finite number). Then you built literally a hash map looking up the context and spitting out the distribution. That certainly is NOT a transformer any more (it's a hash map!!!), but the output approaches indistinguishability as the sample count increases - if the transformer is reasoning, so is the hash map built from it.
I'm not talking hot air here, they really are provably equivalent because a 1:1, onto mapping exists.
For the record, "X is more expressive than Y" means "there exists at least one thing that Y cannot represent and X can". Nothing to do with size or time.