| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by piecerough 506 days ago
	I think the reason why it works is also because chain-of-thought (CoT), in the original paper by Denny Zhou et. al, worked from "within". The observation was that if you do CoT, answers get better. Later on community did SFT on such chain of thoughts. Arguably, R1 shows that was a side distraction, and instead a clean RL reward would've been better suited.

2 comments

singularity2001 506 days ago

One big question will be whether chain of thought within the embedding space will work better than in the token space.

link

kevinventullo 505 days ago

This recent paper is relevant: https://arxiv.org/abs/2412.06769

link

robrenaud 506 days ago

Do you understand why RL is better than SFT for training on reasoning traces?

link

pama 506 days ago

I always assumed the reason is that you are working with the pretrained model rather than against it. Whatever “logic” rules or functions the model came up with to compress (make more sense of) the vast amounts of pretraining data, it then uses the same functions during RL. Of course, distillation from a strong, huge model might still help more than RL directly applied on the small model because the strong model came up with much better functions/reasoning during pretraining, which the small model can simply copy. These models all learn in different ways than most humans, so human-based SFT can only go so far.

link

piecerough 505 days ago

SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer

link