|
|
|
|
|
by utdiscant
639 days ago
|
|
Feels like a lot of commenters here miss the difference between just doing chain-of-thought prompting, and what is happening here, which is learning a good chain of thought strategy using reinforcement learning. "Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses." When looking at the chain of thought (COT) in the examples, you can see that the model employs different COT strategies depending on which problem it is trying to solve. |
|