Hacker News new | ask | show | jobs
by utdiscant 639 days ago
Feels like a lot of commenters here miss the difference between just doing chain-of-thought prompting, and what is happening here, which is learning a good chain of thought strategy using reinforcement learning.

"Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses."

When looking at the chain of thought (COT) in the examples, you can see that the model employs different COT strategies depending on which problem it is trying to solve.

3 comments

I'd be curious how this compared against "regular" CoT experiments. E.g. were the gpt4o results done with zero shot or was it asked to explain it's solution step by step.
It was asked to explain step by step.
It’s basically a scaled Tree of Thoughts
In the primary CoT research paper they discuss figuring out how to train models using formal languages instead of just natural ones. I'm guessing this is one piece to the model learning tree-like reasoning.

Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.

This seems most likely, with some special tokens thrown in to kick off different streams of thought.
To me it looks like they paired two instances of the model to feed off of each other's outputs with some sort of "contribute to reasoning out this problem" prompt. In the prior demos of 4o they did several similar demonstrations of that with audio.
To create the training data? Almost certainly something like that (likely more than two), but I think they then trained on the synthetic data created by this "conversation". There is no reason a model can't learn to do all of that, especially if you insert special tokens (like think, reflect etc that have already shown to be useful)
No I'm referring to how the chain of thought transcript seems like the output of two instances talking to each other.
Right - i don't think it's doing that. I think it has likely been fine tuned to transition between roles. But, maybe you are right.
Reminds me of how Google's AlphaGo learned to play the best Go that was ever seen. And this somewhat seems a generalization of that.