|
|
|
|
|
by diggan
644 days ago
|
|
I think this submission paper is talking about reinforcement learning as part of/after the main training, then the model does inference as normal. They might have done that for O1, but the bigger change is the "runtime train of thought" that once the model received the prompt and before giving a definitive answer, it "thinks" with words and readjusts at runtime. At least that's my understanding from these two approaches, and if that's true, then it's not similar. AFAIK, OpenAI been doing reinforcement learning since the first version of ChatGPT for all future models, that's why you can leave feedback in the UI in the first place. |
|
> Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.
That's incredibly similar to this paper, which is discusses the difficulty in finding a training method that guides the model to learn a self-correcting technique (in which subsequent attempts learn from and improve on previous attempts), instead of just "collapsing" into a mode of trying to get the answer right with the very first try.
[1]: https://openai.com/index/learning-to-reason-with-llms/