| HN Mirror

But it seems like the attention mechanism fundamentally isn't markov-like in that at a given position it can pool information from all other positions. So as in the simplest case when trained on masked-language modeling, the prediction of the mask in "Capital of [MASK] is Paris" can depend bidirectionally on all surrounding context. While I guess it's true that in the case where the mask is at the end (for next-token completion), you could consider this as a markov model with each state being the max attention window (2048 tokens I think?), but that's like saying all real-world computers are FSMs: it's technically true, but this isn't the best model to use for actually understanding its behavior.

Since for most inputs that are smaller than the max token length you never actually end up using the markov-ness, calling it a markov model seems like it's just in a way saying it's a function that provides a probability distribution for the next token given the previous tokens. Which just pushes the question back onto how that function is defined.