|
|
|
|
|
by golol
1188 days ago
|
|
It's wrong. A decoder only transformer performs a (possibly random) operation on a state from the state space {tokens}^CtxWindow, where the distribution of the new state depends entirely on the previous state. It is a Markov Chain with a special structure: The new state is deterministically equal to the old state shifted by one, with only the last token being newly generated. |
|
A tennis ball in flight is a Markov chain since the state at t is a function of the state at t-1.
You have missed the point about the Attention Mechanism in GPT. That is not a Markov chain by definition.