| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sasjaws 114 days ago

A while ago i did the nanogpt tutorial, i went through some math with pen and paper and noticed the loss function for 'predict the next token' and 'predict the next 2 tokens' (or n tokens) is identical.

That was a bit of a shock to me so wanted to share this thought. Basically i think its not unreasonable to say llms are trained to predict the next book instead of single token.

Hope this is usefull to someone.

4 comments

317070 114 days ago

As an expert in the field: this is exactly right.

LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.

justinator 114 days ago

where do you get these books?

honking intensifies

WHERE DO YOU GET THESE BOOKS?!

tasuki 114 days ago

The local library.

benterix 114 days ago

We do things, but it doesn't feel right

fc417fc802 114 days ago

Can anyone even say what a book really is at the end of the day? It's such an abstract concept. /s

TuringTest 114 days ago

Isn't that the same as compressing the whole book, in a special differential format that compares how the text looks from any given point before and after?

317070 114 days ago

There are many ways to model how the model works in simpler terms. Next-word prediction is useful to characterize how you do inference with the model. Maximizing mutual information, compressing, gradient descent, ... are all useful characterisations of the training process.

But as stated above, next token prediction is a misleading frame for the training process. While the sampling is indeed happening 1 token at a time, due to the training process, much more is going on in the latent space where the model has its internal stream of information.

margalabargala 114 days ago

Everything is the same as everything else. It's all just hydrogen and time mixed together.

apexalpha 114 days ago

Are you referring to this one?: https://github.com/karpathy/build-nanogpt

sasjaws 112 days ago

Thats the one, lots of fun and a great entrypoint for experimentation.

croon 114 days ago

Isn't that why noise was introduced (seed rolling/temperature/high p/low p/etc)? I mean it is still deterministic given the same parameters.

But this might be misleadingly interpreted as an LLM having "thought out an answer" before generating tokens, which is an incorrect conclusion.

Not suggesting you did.

throw310822 114 days ago

> this might be misleadingly interpreted as an LLM having "thought out an answer"

I'm convinced that that is exactly what happens. Anthropic confirms it:

"Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so."

https://www.anthropic.com/research/tracing-thoughts-language...

sasjaws 112 days ago

This is about reasoning tokens right? I didnt mean that, nanogpt doesnt do that. Nanogpt inference just outputs letters directly, no intermediate tokens.

throw310822 111 days ago

No, this is about normal tokens. While a SOTA LLM outputs a token at a time, it already has a high level plan of what it is going to say many tokens ahead. This is in reply to the GP who thinks that an LLM can somehow produce coherent and thoughtful sentences while never seeing more than one token ahead.

sasjaws 112 days ago

Thats actually an interesting way to look at it. However i just posted that because i often see articles expressing amazement at how training an llm at next token prediction can take it so far. Seemingly ontrasting the simplicity of the training task to the complexity of the outcome. The insight is that the training task was in fact 'predict the next book', just as much as it is 'predict the next token'. So every time i see that 'predict the next token' representation of the training task it rubs me the wrong way. Its not wrong, but misleading.

I didnt mean to suggest that is how it 'thinks ahead' but i believe you can see it like that in a way. Because it has been trained to 'predict all the following tokens'. So it learned to guess the end of a phrase just as much as the beginning. I consider the mechanism of feeding each output token back in to be an implementation detail that distracts from what it actually learned to do.

I hope this makes sense. Fyi im no expert in any way, just dabbling.

sputknick 114 days ago

I'd like to explore this idea, did you make a blog post about it? is it simple enough to post in the reply?

sasjaws 112 days ago

No blog post, my llm expert friend told me this was kinda obvious when i shared it with him so i didnt think it was worth it.

I can tell you how i got there, i did nanogpt, then tried to be smart and train a model with a loss function that targets 2 next tokens instead of one. Calculate the loss function and you'll see its exactly the same during training.

Sibling commenter also mentions:

> the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation."

Hope that helps.

WithinReason 114 days ago

Look up attention masks

krackers 112 days ago

Unless I've misunderstood the math myself, I don't think GPs comment is quite right if taken literally since "predict the next 2 tokens" would literally mean predict index t+1, t+2 off of the same hidden state at index t, which is the much newer field of multi-token prediction and not classic LLM autoregressive training.

Instead what GP likely means is the observation that the joint probability of a token sequence can be broken down autogressively: P(a,b,c) = P(a) * P(b|a) * P(c|a,b) and then with cross-entropy loss which optimizes for log likelihood this becomes a summation. So training with teacher forcing to minimize "next token" loss simultaneously across every prefix of the ground-truth is equivalent to maximizing the joint probability of that entire ground-truth sequence.

Practically, even though inference is done one token at a time, you don't do training "one position ahead" at a time. You can optimize the loss function for the entire sequence of predictions at once. This is due the autoregressive nature of the attention computation: if you start with a chunk of text, as it passes through the layers you don't just end up with the prediction for the next word in the last token's final layer, but _all_ of the final-layer residuals for previous tokens will encode predictions for their following index.

So attention on a block of text doesn't give you just the "next token prediction" but the simultaneous predictions for each prefix which makes training quite nice. You can just dump in a bunch of text and it's like you trained for the "next token" objective on all its prefixes. (This is convenient for training, but wasted work for inference which is what leads to KV caching).

Many people also know by now that attention is "quadratic" in nature (hidden state of token i attends to states of tokens 1...i-1), but they don't fully grasp the implication that even though this means for forward inference you only predict the "next token", for backward training this means that error for token i can backpropagate to tokens 1...i-1. This is despite the causal masking, since token 1 doesn't attend to token i directly but the hidden state of token 1 is involved in the computation of the residual stream for token i.

When it comes to the statement

>its not unreasonable to say llms are trained to predict the next book instead of single token.

You have to be careful, since during training there is no actual sampling happening. We've optimized to maximize the joint probability of ground truth sequence, but this is not the same as maximizing the probability the the ground truth is generated during sampling. Consider that there could be many sampling strategies: greedy, beam search, etc. While the most likely next token is the "greedy" argmax of the logits, the most likely next N tokens is not always found by greedily sampling N times. It's thought that this is one reason why RL is so helpful, since rollouts do in fact involve sampling so you provide rewards at the "sampled sequence" level which mirrors how you do inference.

It would be right to say that they're trained to ensure the most likely next book is assigned the highest joint probability (not just the most likely next token is assigned highest probability).

sasjaws 112 days ago

The idea i tried to express was purely the loss function thing you mentioned, and how both tasks (1 vs 2 vs n) lead to identical training runs. At least with nanogpt. I dont know if that extrapolates well to current llm internals and current training.