Hacker News new | ask | show | jobs
by AltruisticGapHN 9 days ago
I don't like how most LLM explainer articles and videos say that essentially a LLM " predicts the next word".

I'm a developer but not very good at maths and I still don't understand any of it.

A LLM clearly has some "visual" capacity. You ask Gemini to build something with Canvas and it's able to reason about the shape of things. Like recently I waanted a checkbox that has like a gradient flowing around the edge. It figured out it could use a radial gradient from the center of the checkbox, and overlay that with a small inner div so you only see the edge that looks like the gradient is circling around the checkbox.

How is that "predicting the next word"?

Not saying AI is intelligent or conscious or anything like that, but the algorithm clearly is far more complex than "predicting words".

What I mean, is the LLM is able to represent things in space . That part I don't understand.

I also still dont understand the relationship between the chat based LLM and the multi modal stuff. I think I read somewhere when image is generated it is also tokens?

12 comments

Predicting a word is the final objective, as in the output of the model is a probability distribution of the next token. However, choosing the right token is more complicated than just regurgitating the training data (and you won't encounter an exact example in the training data, so you need to interpolate). This makes the model learn abstract representation of things that it is able to manipulate before outputting this back into token. RL also complicates this because the "fitness" is now some arbitrary metric computed over an entire sequence of tokens.
Your casual understanding is imprecise.

At all times the LLM is, indeed, predicting the next token. Anything it does emerges from that.

It did not "figure anything out". It predicted that text describing the use of a radial gradient was likely to follow text describing your problem.

>At all times the LLM is, indeed, predicting the next token

The point is that saying they're just "predicting the next token" is not at all explanatory nor providing insight. Saying the brain is just firing action potentials gives you no understanding about how the brain does what it does or what the space of its capabilities are. Similarly, predicting the next token tells you nothing about the capabilities of LLMs.

True, but that is a great fact to start from, and understand.

Then the next question becomes "HOW do they predict the next token?" There are many ways that can be done, why is this particular algorithm so GOOD?"

When people say "We don't understand how LLM works" isn't it really saying we don't understand how this specific algorithm used to predict the next token works? No, it is not, because "we" do understand how all those algorithms work there are many descriptions of them available.

So the question then really is "Why is the prediction this algorithm makes, so good, as compared to some other statistical algorithms?"

It's not about "Why does AI work so well?". It should be "Why does this particular XYZ algorithm work so well?"

I think it's a perfectly fine one liner explanation. If a kid asks why grass is green, do you stop explaining when you say chlorophyll is green, or do you go on to explain electron hybridization and all the spectra stuff, or do you go further to explain the structure of our eyes and why we perceive that reflected light as green? Also why green? Why not red? Do you have to explain that? It all depends on the audience, the context, and how much space you have to explain as well as how much you know. For you and more experienced people of course this is not sufficient and so you need to know more being "predict tokens" and so that opens up follow up questions like "how does it do that".
The point is that the output is text that is statistically correlated with the input.

The capability of the LLM is not to reason, it's to generate text that matches the patterns seen in the training corpus. It's possible that all you need to "reason" is plausible text generation. I'm not saying it's not. But nothing the LLM does fails to be explained by plausible-text-generation.

I contend that the best way to understand an LLM's capabilities is to understand the nature of the probability distribution that produced it. For instance, why does an "angry" prompt tend to produce more help than a "polite" one? Trying to explain that in terms of emotions or reasoning doesn't make sense, but it's readily possible to explain through the connections between text in the training corpus...

>The point is that the output is text that is statistically correlated with the input.

But we can simply note that this description applies to any machine learning algorithm. Yet LLMs are lightyears better than, say, Markov chains. What people are after is something that elucidates the features of LLMs that allow them to be so productive over what came before.

There is absolutely nothing stopping someone from distilling a modern LLM into a very effective Markov chain. The physical size of the model would explode because a context window containing C tokens of size B would need B^C Markov prior states, but the actual output would be a deterministic version of the LLM's with top-n n=1 sampling.

In other words, a Markov chain and a Transformer model are exactly equivalent in power (there is NOTHING that can be done with one and not the other). The Transformer model is just better pretrained and a more efficient compression/generation.

>In other words, a Markov chain and a Transformer model are exactly equivalent in power

Nonsense. Markov chains treat the past context as a single unit, an N-tuple with no internal structure. LLMs leverage the internal structure of the context which allows a large class of generalization that Markov chains necessarily miss.

Lol, the bird did not 'fly' - it just flapped its wings and generated lift!
No. The how is relevant here because it leads to understanding of the resulting behavior.

If you train the LLM on a corpus that shows people saying the sky is red, you get an LLM that is predisposed to say the sky is red. This is true even if it's also trained on all of the science that explains how and why the sky is blue.

If it were to "figure out" or "reason", it would not have such a predisposition to emit "red" after "the sky is" just because that matches the reward during training.

In other words, the token prediction is important because it both explains the successes AND the failures of the LLM. If there were situations in which a bird could fail to fly, then how it tried to fly would also be crucial knowledge.

You can also teach humans science and math and then they can be trained by a cult to not use any of that reasoning when emitting canned responses that they were rewarded by the cult for internalizing during their training. "Fake News!"

You're caught up on the mechanics of token processing (floating point matrix ALU math) and ignoring the context that p(next token) as a function being "computed" is doing so over a trillion parameters. You can poorly train a model, sure, but assuming you don't indoctrinate it too much, properties like cognition emerge - it learns to reason; why? Reasoning is more efficient and compact than memorizing answers.

I completely agree that humans sometimes are not applying reasoning to things.

I'm not trying to argue a model cannot "reason" or have "cognition", whatever those things are. I'm only saying that it's absolutely the case that whatever those things are, they come from its mechanism of predicting one token at a time ad infinitum, and that throwing away a deep understanding in favor of a shallow one is foolish. Just because it might seem to be "reasoning" does not mean it IS doing so, and certainly giving the appears of reasoning does not mean it is NOT a token predictor.

If I knew deeply how the human brain works I would use that understanding instead of saying things like "this person reasons" or "this person thinks".

In summary, I'm not "caught up in" anything - I'm just trying to point out that the original poster here is incorrect in saying that clearly LLMs aren't working through token prediction. They are, and all their behavior is 100% explained by token prediction. That's more than enough for interesting behavior!

A dominant theory for human cognition is predictive coding

https://en.wikipedia.org/wiki/Predictive_coding

More like being suspended by a thread...
It's still predicting the next word. Somewhere in the gigantic dataset that the LLM was trained on, there is a phrase that says "gradient border" being in the vicinity of a CSS code that render the stuff. Therefore when you run it on an inference loop there's a good chance it output that CSS code when you tell it to render a "gradient border"

Multi-modal models that can understand visual input do exists, but no such visual reasoning process happened in the example you mentioned. Not unless you have a visual feedback loop in the coding harness.

I'm not dismissing the capability of "predicting the next word" however. The vast amount of training data enable extremely complex and useful behavior you just described.

What about things it wasn’t trained on?

For instance I’ve written a few custom languages to learn how to write a VM and the lexer/parser/compiler/etc. that it had never seen before and then just gave it the syntax which is different than what it had ever seen before. Simply due to the fact I made it and it had never been trained on it.

After giving it my documentation, it was able to write the language just like a language that it had been trained on. I’ve also seen this behavior at work where there are weird quirks to do things and definitely not standard and it can handle it.

Because in its training data there is information on how to map from documentation of a language to actual programs. This means that following the pattern it can map between documentation for any language to programs in that language.

But I think it will have difficulty in crossing paradigm boundaries, by simply using documentation.

That’s because it does not encode words or keywords or anything like that. It encodes their relationships. A formal language like a programming language are pretty compact. There’s not much variation between the C-like languages. Just like most Lisps (clojure, scheme, elisp, racket) are fairly similar to each other.

The exact syntax does not matter, only the grammar. If you give it the grammar, and then the keywords, it can find something that has similar grammar and then use your keywords.

I'd be very careful assuming something is not in an LLM's training set. Those data sets are truly vast. And, from experience, people tend to miss a lot of their content.

As a for instance, back in the day some academics wrote a paper that compared GPT 3.5 to a couple of inductive programming systems (including one of mine) on solving programming problems in a certain well-known esoteric language which I shall call "L". The task was to solve those programming problems one-shot. The authors asserted that the "L" problem sets were unlikely to be in 3.5's training set, but I found them without much search in a public github repo. I mean the entire dataset was right there. In this case the researchers are colleagues and friends and I know they weren't simply negligent or malicious, they just missed the fact that their "unlikely to be in the training set" data was on the web.

So I'd always assume that if an LLM can perform a task that's because it's seen examples of the task during its training.

Without forgetting that LLMs have this really shockingly powerful ability to interpolate between examples and they can improve their performance on say Task A by training on Task B, where A and B are different but similar.

e.g. they seem to get better at translating between language pairs of which they have few examples of parallel text by training on other pairs of languages for which they have more parallel text; they seem to learn something about language translation in general by training on more examples of translation. I haven't got a good reference on that handy but it's well-known (and of course over-hyped and exaggerated by tech CEOs).

So without wanting to diminish your work, I'd guess that your new language's syntax is different and novel but everything else about it is more ordinary and the similarities are such that an LLM can wing it and write you a lexer etc. After all, the whole point about parser generators and similar tools is that the task can be abstracted and separated from syntax in the first place.

In fact LLMs are very good at that sort of thing, filling in the blanks as it were. I'm old enough to remember the excitement about GPT 3.5 being able to form syntactically correct sentences with nonsensical words give to it.

For example, I just asked Chat [1]:

  Hey chat. The gostak distims the doshes. What happens to the doshes?
And it promptly answered:

  The doshes get distimmed.
See, it even got the spelling right!

_________________

[1] https://chatgpt.com/c/6a242b65-e248-83ed-9a6e-f238a1e871b6

I could answer the same query the same way as a child.
I do not think you are correct.
LLMs fundamentally work by predicting the next word (token). But that should not be used to diminish their potential capabilities. It's like saying that human brains "just predict (or produce) the next electrical impulse". Fundamentally correct, but says nothing about the potential emergent capabilities of scaled-up systems that work like that.

Emergent properties of complex systems should not be diminished just because the underlying operating principle is simple.

So much this - so many people seem to miss the forest from the trees that emergent properties are not bound to the complexity of the underlying mechanics.

All of life arises (maybe) from very simple subatomic particles, and at each stage you can repeat this refrain, complexity increasing as you stack.

True, but there are also some who push it too far. As with all things, take it with moderation.
Game of Life comes to mind: Most simple logic, emerging patterns are hard to believe.
I understand that to be the "emergent abilities" which are spoken about. There are correlations in the dataset that are strong enough for it to seem to have an understanding which wasn't obvious it would have from simply "predicting the next word".
> What I mean, is the LLM is able to represent things in space . That part I don't understand.

Why do you think this is mutually exclusive to "LLM predicts the next token"?

If you tell someone from 19th century that bytes (just 0s and 1s!) can represent an opera, a song, or even a whole interactive experience, they might be really confused. But there is no reason they can't.

If you tell someone without math background that the sums of smaller and smaller sin waves can represent pretty much anything in our universe, they might be really confused. But there is no reason they can't.

There is simply no reason that a next-token predicator can't generate a nice-looking checkbox.

You're talking about simple compression and encoding mechanisms and by implication you're drawing an analogy to an LLM encoding/compressing the information..

And sure, it does, but the person you're replying to was trying to understand why it also seems to reason about the query to give an answer consistent with it, despite not being trained on that query or answer. Your answer seems to imply that its just another slick complex encoding.

But the emergent property of trillions of digital neurons predicting the next token is that in the process of being trained to do so, they can also learn to reason.

At some scale, it is efficient to encode cognition which is capable of mimicing the cognition which generated the input tokens.

I don't want to pretend I can explain LLMs, but the same "math" can be applied for visual and non visual things. The dot product of two vectors gives you the angle between them. This is true in 2 or 3 dimensions. But it's also true in 4, 5, 6...n dimensions even though we cannot visualize a 4d space. That it's an angle is relevant for you in the space you can comprehend, but for math or a machine it works in any number of dimensions. So it does need to understand anything visually if the math checks out.
LLMs are modelled to predict the next token, and are indeed trained to do so on enormous bodies of text. But to be really good at predicting the next token (word) at the end of a long string of text, you must understand what the text means. If I give you the entire text of a long novel and at the end ask you a single "yes/ no" question about the plot, you only need to emit a single token, but emitting the correct one implies having understood the plot of the novel. This is what LLMs do. They're generating meaningful, coherent text, which implies understanding and cognition at a level that is much deeper than that of the single token they generate at each forward pass. Internally, the LLM has learned to represent the meaning of the entire prompt text, the concepts it implies and its possible continuations far beyond the horizon of simply outputting the next token.
> This is what LLMs do. They're generating meaningful, coherent text

No, they generate grammatically coherent text. That is because human language grammars are fundamentally mathematical structures that can be approximated with matrix operations.

They don't generate meaningful text because they have no inherent knowledge of the world.

If you've used LLMs for any amount of time you've already noticed how often they get confused about numeric quantities - like confusing notions of "bigger than" and "less than" or being unable to count letters in words.

This is because any meaning in their output is only accidental.

>is the LLM is able to represent things in space

It is imitating the text written by humans who can represent things in space.

Sorry you're being downvoted for asking a very reasonable question. I don't think any of the replies here address your question either.

If I can do my best to answer, Gemini is a multi-modal system. That means it's trained not only on text but also still images, video and also sound. The training happens in parallel and the representation of each modality is usually different, so the image recognition part is not trained on text tokens but pixels, the video part (probably) on video frames etc. There is some kind of integrated training that goes on so that text can be generated that is correlated to an image and so on, but I don't know the specifics about Gemini in particular. This kind of thing is not exactly new either, you can find systems that captioned images before the rise of LLMs simply by training on examples of images coupled to their textual descriptions.

In that sense it's not entirely correct to call Gemini an "LLM" because it's not only a "language" (or, more precisely, text) model. But LLM I guess becomes a bit of a shorthand for everything based on, or combined with, an LLM.

Anyway that's what's going on: it's not just predicting the next word. It's also predicting the next image frame or the next set of pixels etc associated with the next word.

It can’t. It’s like a Redditor, it just repeats what it has seen other people say.

It has read all of stackoverflow, so it has seen your kind of problem before. Try asking it something really unusual and it will shit the bed.

> It’s like a Redditor, it just repeats what it has seen other people say.

Can stochastic parrots understand irony?

I do agree bigly. Calling what is basically a superhuman brain inside a computer just a "token predictor" is peak thinkslop.
Inside the magic AI box is literally nothing but this loop:

    int n_tokens = 0;
    while (n_tokens < TOKENS_MAX) {
        int next_token = decode(context, ++position);
        print(token_to_text(next_token));
        ++n_tokens;
    }
If you don't believe me then just download llama.cpp and see for yourself.
decode() looks simple! Wow, obviously intelligence can't live behind that function call! /s

Now, take that for loop, and replace the implementation of decode(context, ++position) and pass it to a human who was bored enough to play along and use a notebook to organize their thoughts and translate them to/from this encoding (you might write a helper function to do this for the human in the front-end of the new decode() impl, but the data flow in and out of decode() will remain the same):

decode(context, position)

{

  cached string_answer = ask_human_question_via_context(context);

  return decode_human_answer_to_tokens(cached string_answer, position);
}

Is the output you get not thinking anymore because it passed through this harness? Did the human's mind somehow get reduced to mere interpolation?

The human mind is still a human mind. Putting a simple harness in front of a mind does not affect its fundamental properties.

In an LLM, decode() is calling into a trillion parameter connectome.

The LLM predicts next token one at a time. (Stochastically.) This is a literal truth.

Deal with it.

It's a literal truth that predicting the next token one at a time does not preclude intelligence on the other side of the decode function. Deal with it.
To be honest, if this is intelligence, then it's really boring. We can't even simulate the brain of a 1000 neuron nematode. You're telling me we can't even run nematode.exe but somehow we have already created intelligence?
Yes, for a certain vague notion of "intelligence". Even homes, televisions and watches are "smart" now.