| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chaeronanaut 484 days ago
	> The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it! This is false, reasoning models are rewarded/punished based on performance at verifiable tasks, not human feedback or next-token prediction.

1 comments

Xelynega 484 days ago

How does that differ from a non-reasoning model rewarded/punished based on performance at verifiable tasks?

What does CoT add that enables the reward/punishment?

link

Jensson 484 days ago

Without CoT then training them to give specific answers reduces performance. With CoT you can punish them if they don't give the exact answer you want without hurting them, since the reasoning tokens help it figure out how to answer questions and what the answer should be.

And you really want to train on specific answers since then it is easy to tell if the AI was right or wrong, so for now hidden CoT is the only working way to train them for accuracy.

link