Hacker News new | ask | show | jobs
by quag 872 days ago
Is the author claiming that LLMs are Markov Chain text generators? That is, the probability distribution of the next token generated is the same as the probability of those token sequences in the training data?

If so, does it suggest we could “just” build a Markov Chain using the original training data and get similar performance to the LLM?

3 comments

LLMs are Marokov Chains in the following sense: States are vectors of context-length many tokens. Then the model describes a transitions matrix: For a given context-length sized vector of tokens it gives you probabilities for the next context-length sized vector of tokens.
Could you elaborate what context length means in this context? Maybe an example?
The length of the input in tokens. For the simple case of tokens just being characters, a LLM does nothing but take a string of length n, the context length, and calculate for each character in the alphabet the probability that this character is the next character following the input. Then it picks one character at random according to that distribution, outputs it as the first character of the response, appends it to the input, discards the first character of the input to get it back to length n and then repeats the entire process to produce the next character of the response.
No, because the LLM isn't just copying from the same text. Rather, it's "classifying" the text using the self-attention, and then applying a simple Markov Chain (supposedly). The classification is the hard part because how do you know what text from the training data is "similar" to the prompt text.

From the blog post for example:

Original string: 'And only l'

Similar strings: 'hat only l' 's sickly l' ' as\nthey l' 'r kingly l'

From the post:

> I implemented imperative code that does what I’m proposing the transformer is doing. It produces outputs very similar to the transformer.

This means there is probably a way to bypass transformers and get the same results. Would be interesting if it's more efficient. Like given foundation model train something else and run it on much smaller device.

I explained that it's not bypassing transformers and not more efficient in another comment: https://news.ycombinator.com/item?id=39254966