|
|
|
|
|
by piecerough
506 days ago
|
|
I think the reason why it works is also because chain-of-thought (CoT), in the original paper by Denny Zhou et. al, worked from "within". The observation was that if you do CoT, answers get better. Later on community did SFT on such chain of thoughts. Arguably, R1 shows that was a side distraction, and instead a clean RL reward would've been better suited. |
|