Hacker News new | ask | show | jobs
by 317070 111 days ago
As an expert in the field: this is exactly right.

LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.

2 comments

where do you get these books?

honking intensifies

WHERE DO YOU GET THESE BOOKS?!

The local library.
We do things, but it doesn't feel right
Can anyone even say what a book really is at the end of the day? It's such an abstract concept. /s
Isn't that the same as compressing the whole book, in a special differential format that compares how the text looks from any given point before and after?
There are many ways to model how the model works in simpler terms. Next-word prediction is useful to characterize how you do inference with the model. Maximizing mutual information, compressing, gradient descent, ... are all useful characterisations of the training process.

But as stated above, next token prediction is a misleading frame for the training process. While the sampling is indeed happening 1 token at a time, due to the training process, much more is going on in the latent space where the model has its internal stream of information.

Everything is the same as everything else. It's all just hydrogen and time mixed together.