In learning to predict the next token, the model has to pick up lots of little bits of world knowledge. I'm sure someone would disagree with the phrasing of "understand", but it certainly operates with more complexity than, say, a markov chain. It has seen lots of python, and in order to predict better, it has developed internal models of how python works. Think of how much better you'd do predicting the next character of python code compared to random noise- there's a lot of structure there.
In my (limited) experience it seems to perform even better for typed languages (for example Kotlin/Java/Swift) compared to Python. The Python code it provided often had subtle type issues when working with dates. While the Kotlin date-related code it provided was more accurate and correct in terms of types. Which makes sense since the additional type information likely leads to a much better "internal model of how Kotlin works"
What surprised me was the level of "understanding" it seems to do when providing it with some of my own sample code. It can analyze the code, explain how it works/what it does, use libraries, suggest improvements and apply those improvements.
While the end result isn't perfect, it's still highly impressive and while I was an AI-skeptic before, I now see the possible benefits of AI assistants for programming.
Some other prompts with very impressive results:
* "Write an implementation for the following Kotlin repository interface: <insert-interface-with-full-type-signatures>."
* (followup) "Add save/load methods that store the backing map in a JSON file"
* (followup) "Replace Gson with Jackson for JSON serialization"
* "Write an Android layout xml for a login form with username/password/loginbutton"
* (followup) "Provide the Kotlin activity code for this layout"
* "Write a Kotlin function that parses a semver input string into a data class"
In my (limited) experience it seems to perform even better for typed languages (for example Kotlin/Java/Swift) compared to Python. The Python code it provided often had subtle type issues when working with dates. While the Kotlin date-related code it provided was more accurate and correct in terms of types. Which makes sense since the additional type information likely leads to a much better "internal model of how Kotlin works"
I think another possibility here is that they might have used an execution environment to check whether the code the model came up with actually compiles and used that as additional input during training. Some sort of execution environment seems to me to also be a possible explanation for how they managed the model to emulate a terminal so well.
It’s not ‘more complexity’ than a Markov chain - it essentially is a Markov chain, just looking at a really deep sequence of preceding tokens to decide the probabilities for what comes next.
And it’s not just looking that up in a state machine, it’s ‘calculating’ it based on weights.
But in terms of ‘take sequence of input tokens; use them to decide probable next token’, it’s functionally indistinguishable from a Markov chain.
I look at deep sequences of tokens and predict what comes next- can you milk me? Once you've broadened "basically a markov chain" to "any function from a sequence of tokens to a probability distribution of tokens" there's a lot of explanatory power lost. If you had to characterize the difference between brute force mappings based on pure frequencies and model which selectively calculates probabilities based on underlying structure, wouldn't you say the latter had more complexity?
You don't have to believe the hype, but if you think you can get GPT performance out of anything remotely resembling a markov chain, I encourage you to try.
There's nothing about Markov chains that says the model has to be based on brute calculation from previously observed frequencies. The point is that the exact behavior of these LLMs could also be modeled as a Markov chain with a sufficiently massive state machine.
Obviously that's impractical and not how LLMs actually work - they derive the transition probabilities for a state from the input, rather than having it pre-baked - but I think from the point of view of saying 'these are more sophisticated than a Markov chain', actually strictly speaking they aren't - they are in fact a lossy compression of a Markov model.
But it seems like the attention mechanism fundamentally isn't markov-like in that at a given position it can pool information from all other positions. So as in the simplest case when trained on masked-language modeling, the prediction of the mask in "Capital of [MASK] is Paris" can depend bidirectionally on all surrounding context. While I guess it's true that in the case where the mask is at the end (for next-token completion), you could consider this as a markov model with each state being the max attention window (2048 tokens I think?), but that's like saying all real-world computers are FSMs: it's technically true, but this isn't the best model to use for actually understanding its behavior.
Since for most inputs that are smaller than the max token length you never actually end up using the markov-ness, calling it a markov model seems like it's just in a way saying it's a function that provides a probability distribution for the next token given the previous tokens. Which just pushes the question back onto how that function is defined.
Could you not use two Markov chains for masked language modeling? One working from the beginning until [MASK] and one working backwards from the end until [MASK]. And then set [MASK] to the average of both chains. If a direct average cannot be found, it is assumed to be a multi-word-expression and words are generated from the two chains until they match.
It's really awesome how good it is in modeling certain world knowledge. It seems to be struggling with putting everything in one framework. For example, it still makes a lot of mathematics and logic errors.
Make a large enough model and train it with all sorts of data and it will be able to encode generalized concepts which can then be applied to specific tasks (given only a few examples of the task, or even just a query / question, rather than an example)