Hacker News new | ask | show | jobs
by hackinthebochs 5 days ago
>At all times the LLM is, indeed, predicting the next token

The point is that saying they're just "predicting the next token" is not at all explanatory nor providing insight. Saying the brain is just firing action potentials gives you no understanding about how the brain does what it does or what the space of its capabilities are. Similarly, predicting the next token tells you nothing about the capabilities of LLMs.

3 comments

True, but that is a great fact to start from, and understand.

Then the next question becomes "HOW do they predict the next token?" There are many ways that can be done, why is this particular algorithm so GOOD?"

When people say "We don't understand how LLM works" isn't it really saying we don't understand how this specific algorithm used to predict the next token works? No, it is not, because "we" do understand how all those algorithms work there are many descriptions of them available.

So the question then really is "Why is the prediction this algorithm makes, so good, as compared to some other statistical algorithms?"

It's not about "Why does AI work so well?". It should be "Why does this particular XYZ algorithm work so well?"

I think it's a perfectly fine one liner explanation. If a kid asks why grass is green, do you stop explaining when you say chlorophyll is green, or do you go on to explain electron hybridization and all the spectra stuff, or do you go further to explain the structure of our eyes and why we perceive that reflected light as green? Also why green? Why not red? Do you have to explain that? It all depends on the audience, the context, and how much space you have to explain as well as how much you know. For you and more experienced people of course this is not sufficient and so you need to know more being "predict tokens" and so that opens up follow up questions like "how does it do that".
The point is that the output is text that is statistically correlated with the input.

The capability of the LLM is not to reason, it's to generate text that matches the patterns seen in the training corpus. It's possible that all you need to "reason" is plausible text generation. I'm not saying it's not. But nothing the LLM does fails to be explained by plausible-text-generation.

I contend that the best way to understand an LLM's capabilities is to understand the nature of the probability distribution that produced it. For instance, why does an "angry" prompt tend to produce more help than a "polite" one? Trying to explain that in terms of emotions or reasoning doesn't make sense, but it's readily possible to explain through the connections between text in the training corpus...

>The point is that the output is text that is statistically correlated with the input.

But we can simply note that this description applies to any machine learning algorithm. Yet LLMs are lightyears better than, say, Markov chains. What people are after is something that elucidates the features of LLMs that allow them to be so productive over what came before.

There is absolutely nothing stopping someone from distilling a modern LLM into a very effective Markov chain. The physical size of the model would explode because a context window containing C tokens of size B would need B^C Markov prior states, but the actual output would be a deterministic version of the LLM's with top-n n=1 sampling.

In other words, a Markov chain and a Transformer model are exactly equivalent in power (there is NOTHING that can be done with one and not the other). The Transformer model is just better pretrained and a more efficient compression/generation.

>In other words, a Markov chain and a Transformer model are exactly equivalent in power

Nonsense. Markov chains treat the past context as a single unit, an N-tuple with no internal structure. LLMs leverage the internal structure of the context which allows a large class of generalization that Markov chains necessarily miss.

No, not nonsense.

Both are a lookup table whose key is the entire context window and whose value is a probability distribution for what the next token should be.

You can say the choice of probability distribution in the value is "leveraging the internal structure of the context" or not, but the same tokens in two different orders are two different lookup keys and saying it's impossible to achieve some result with a Markov chain is factually incorrect.

https://arxiv.org/pdf/2410.02724 describes the equivalence formally.

That paper doesn't prove the equivalence of Transformers and Markov chains, it uses Markov chains as a theoretical model to understand the behavior of Transforms. The expressivity of the model matters, and Transformers just are more expressive than Markov chains.

>but the same tokens in two different orders are two different lookup keys

This is necessarily true for Markov chains and not necessarily true for Transformers. Transformers learn invariance over certain kinds of semantically irrelevant transformations. The Markov chain simply has to learn each input variant independently, resulting in an explosion of state space and data requirements compared to the functionally equivalent transformer. Expressive power matters.

I really don't get people's love for saying X is "just" Y (it's just a Markov chain, it's just a Kernel method). It's a strange pathology to focus on the superficial similarity while downplaying the boost in expressive power from where the models diverge.