| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by reissbaker 611 days ago

This is not how LLMs work. An LLM doesn't "scour through its vast training data" at response generation time to "look for patterns — specifically, the most probably sentence/sentences that fit the question," nor does "probability" in terms of LLMs refer to "frequency — how many times something appears in its training data."

LLMs don't even have access to their training data at response generation time. And responses aren't created by "voting" on how many times it's seen a particular phrase (and it doesn't operate on sentences: it operates on word fragments, aka tokens).

I'd recommend Andrej Karpathy's "Neural Netowrks: Zero to Hero" YouTube lecture series (https://karpathy.ai/zero-to-hero.html), but here's a pretty condensed overview: the way LLMs work is they serially generate tokens when you ask them a question (generating tokens is referred to as "inference"). During training, we start with a random set of values for the model parameters (the "weights") and ask the model to predict the probabilities for token sequences. At the beginning of the process, it's usually very wrong, unless you won some unbelievably lucky universal dice roll. But we edit the values of the weights based on how right or wrong the model's predictions were by using backpropagation + a loss function to determine which parameters most influenced a particular prediction, and use gradient descent in that highly-dimensional space to perturb the parameters according to some value determined by an optimization function. Doing this zillions of times repeatedly is how we end up with the final values for the parameters — aka the model weights — it's not by "counting votes" or even "counting sequences," it's by calculating which parameter values are the best ones to predict the token probabilities. The frequency of tokens appearing in the dataset (or even the frequency of sentences) isn't directly computed, although extremely-high-frequency sequences might be memorized — but it can't memorize all of them, because the models are much smaller than the training set.

The theory for why this works so well is that since the models don't have enough space to memorize the entirety of the massive training set (i.e. Llama 3.1 8b is about 16GB, but was trained on the entire internet which is many orders of magnitude larger than that), the best values for the parameters are actually ones that create some sort of underlying world model for why that token sequence probability was predicted. That's very different than "counting votes," or counting sentence frequencies. Even if you disagree with this hypothesis for why the models work, you have to admit there's some underlying meaning being parsed out: it's not just memorization, even if it does sometimes memorize useful facts (useful, at least, to predict probabilities during training time). It simply can't memorize its training set: the training data is way too large compared to the size of the model, and it doesn't get to cheat and look at the training data when asked a question.