|
|
|
|
|
by fenomas
488 days ago
|
|
As I understood that part, in RL for LLMs you take questions for which the model already sometimes emits correct answers, and then repeatedly infer while reinforcing the activations the model made during correct responses, which lets it evolve its own ways of more reliably reaching the right answer. (Hence the analogy to training AlphaGo, wherein you take a model that sometimes wins games, and then play a bunch of games while reinforcing the cases where it won, so that it evolves its own ways of winning more often.) |
|
In the LLM case you have to have an already capable model to do RL. Also I feel like the problem selection part is important to make sure it's not too hard. So there's still much labor involved.