|
Some commenters seem a bit confused as to how this works. Here is my understanding, hoping it helps clarify things. Ask something to a model and it will reply in one go, likely imperfectly, as if you had one second to think before answering a question. You can use CoT prompting to force it to reason out loud, which improves quality, but the process is still linear. It's as if you still had one second to start answering but you could be a lot slower in your response, which removes some mistakes. Now if instead of doing that you query the model once with CoT, then ask it or another model to critically assess the reply, then ask the model to improve on its first reply using that feedback, then keep doing that until the critic is satisfied, the output will be better still. Note that this is a feedback loop with multiple requests, which is of different nature that CoT and much more akin to how a human would approach a complex problem. You can get MUCH better results that way, a good example being Code Interpreter. If classic LLM usage is system 1 thinking, this is system 2. That's how o1 works at test time, probably. For training, my guess is that they started from a model not that far from GPT-4o and fine-tuned it with RL by using the above feedback loop but this time converting the critic to a reward signal for a RL algorithm. That way, the model gets better at first guessing and needs less back and forth for the same output quality. As for the training data, I'm wondering if you can't somehow get infinite training data by just throwing random challenges at it, or very hard ones, and let the model think about/train on them for a very long time (as long as the critic is unforgiving enough). |