Y
Hacker News
new
|
ask
|
show
|
jobs
by
AlexCoventry
205 days ago
"Process-oriented" verification has been a thing for a while in mathematical reasoning CoT. Google had a paper about it last year [1]. The key term to look for is "Process-reward model." I particularly like RL Tango [2].
[1]
https://arxiv.org/abs/2406.06592
[2]
https://arxiv.org/abs/2505.15034