Y
Hacker News
new
|
ask
|
show
|
jobs
by
piecerough
505 days ago
SFT forces the model to output _that_ reasoning trace you have in data. RL allows whatever reasoning trace and only penalizes it if it does not reach the same answer