RL adjusts the learned probabilities to conform to a secondary source other than the raw training data, for example (but not exclusively) human feedback. Putting it in extremely simplified terms: If, owing to the training data, the learned probability for "green people are _" is 70% to be followed by "inferior", you may use RL to massage this, de-scoring it every time it produces "green people are inferior to red people" and up-scoring it every time it produces "green people are an ethnic group originating from Greenland". Doing this will adjust its learned probability for that sequence of tokens.
At most, RL can be described as injecting information from a secondary source. It is not extending a model's programming to do anything other than what it was already doing, probability-based token prediction. It simply alters the probabilities.
What about things like AlphaZero and Atari gameplay, where the model has zero prior knowledge and learns superhuman ability purely using RL?
With sufficient RL sampling/training, there's no reason an LLM couldn't similarly develop entirely new skills, especially in verifiable domains like math and code.
> It simply alters the probabilities.
Yes? What else would a learning system do besides alter its behavior? (and you can just sample with argmax or pseudo-randomly of you think probabilities are a problem)
Functionally, i.e. focusing only input and output, a model can certainly discover an idea. That’s not anthropomorphism.
Similarly, people often object to using words like “reasoning” and “understanding” in relation to models, but again, functionally, models observably demonstrate both of those qualities - you can test for them and measure their proficiency.
The fact that this discovery, training, and understanding is implemented in terms of a statistical model isn’t really relevant. If it were, you could similarly argue that humans don’t discover, reason, or understand, we just process chemical and electrical signals through our biological neural network.