| > As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique. That more-or-less sums up the nuance. I just think the nuance is crucially important, because it greatly improves intuition about how the models function. In your example (which is a fantastic example, by the way), consider the case where the LLM sees: <user>What do you call someone who studies the stars?</user><assistant>An astronaut What is the next prediction? Unfortunately, for a variety of reasons, one high probability next token is: \nAn Which naturally leads to the LLM writing: "An astronaut\nAn astronaut\nAn astronaut\n" forever. It's somewhat intuitive as to why this occurs, even with SFT, because at a very base level the LLM learned that repetition is the most successful prediction. And when its _only_ goal is the next token, that repetition behavior remains prominent. There's nothing that can fix that, including SFT (short of a model with many, many, many orders of magnitude more parameters). But with RL the model's goal is completely different. The model gets thrown into a game, where it gets points based on the full response it writes. The losses it sees during this game are all directly and dominantly related to the reward, not the next token prediction. So why don't RL models have a probability for predicting "\nAn"? Because that would result in a bad reward by the end. The models are now driven by a long term reward when they make their predictions, not by fulfilling some short-term autoregressive loss. All this to say, I think it's better to view these models as they predominately are: language robots playing a game to achieve the highest scoring response. The HOW (autoregressiveness) is really unimportant to most high level discussions of LLM behavior. |
Similarly, instead of waiting for whole output, loss can be decomposed over output so that partial emits have instant loss feedback.
RL, on the other hand, is allowing for more data. Instead of training on the happy path, you can deviate and measure loss for unseen examples.
But even then, you can avoid RL, put the model into a wrong position and make it learn how to recover from that position. It might be something that’s done with <thinking>, where you can provide wrong thinking as part of the output and correct answer as the other part, avoiding RL.
These are all old pre NN tricks that allow you to get a bit more data and improve the ML model.