| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by utdiscant 686 days ago

Feels like a lot of commenters here miss the difference between just doing chain-of-thought prompting, and what is happening here, which is learning a good chain of thought strategy using reinforcement learning.

"Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses."

When looking at the chain of thought (COT) in the examples, you can see that the model employs different COT strategies depending on which problem it is trying to solve.

3 comments

persedes 686 days ago

I'd be curious how this compared against "regular" CoT experiments. E.g. were the gpt4o results done with zero shot or was it asked to explain it's solution step by step.

link

nmca 685 days ago

It was asked to explain step by step.

link

mountainriver 685 days ago

It’s basically a scaled Tree of Thoughts

link

qudat 685 days ago

In the primary CoT research paper they discuss figuring out how to train models using formal languages instead of just natural ones. I'm guessing this is one piece to the model learning tree-like reasoning.

Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.

link

danielmarkbruce 685 days ago

This seems most likely, with some special tokens thrown in to kick off different streams of thought.

link

Zenzero 685 days ago

To me it looks like they paired two instances of the model to feed off of each other's outputs with some sort of "contribute to reasoning out this problem" prompt. In the prior demos of 4o they did several similar demonstrations of that with audio.

link

danielmarkbruce 685 days ago

To create the training data? Almost certainly something like that (likely more than two), but I think they then trained on the synthetic data created by this "conversation". There is no reason a model can't learn to do all of that, especially if you insert special tokens (like think, reflect etc that have already shown to be useful)

link

Zenzero 685 days ago

No I'm referring to how the chain of thought transcript seems like the output of two instances talking to each other.

link

danielmarkbruce 685 days ago

Right - i don't think it's doing that. I think it has likely been fine tuned to transition between roles. But, maybe you are right.

link

diedyesterday 684 days ago

Reminds me of how Google's AlphaGo learned to play the best Go that was ever seen. And this somewhat seems a generalization of that.

link