|
|
|
|
|
by chaeronanaut
438 days ago
|
|
> The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it! This is false, reasoning models are rewarded/punished based on performance at verifiable tasks, not human feedback or next-token prediction. |
|
What does CoT add that enables the reward/punishment?