Y
Hacker News
new
|
ask
|
show
|
jobs
by
zby
503 days ago
I think is where the relative rewards come to play - they sample many thinking traces and reward those that are correct. This works at the current 'cutting edge' for the model - exactly where it could be improved.