Hacker News new | ask | show | jobs
by pas 302 days ago
sorry, could you explain why is it harder, where the complexity creeps in (compared to some naive "pattern matching the end of the response" tactic)? thanks!
2 comments

The pattern matching compares what was said against an example of what a correct response could say.

Checking a token at a time evaluates if it is going to produce a correct final answer. The intermediate text can be whatever it needs to arrive at that answer, but training at the per token level means training those very tokens that you want to allow the model the leeway to consider. It needs another model to adjudicate how well things are going from incomplete answers.

I'm not sure how much the adjudicator evaluates based upon knowing the final answer or based upon the quality of the reasoning of the model being trained. I'd be inclined to train two adjudicators, one that knows the answers and one that doesn't. I'm sure there would be interesting things to see in their differential signal.

Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.
Yup, RLVR as implemented by Deepseek et al. use only outcome supervision instead of process supervision. There have been attempts to do process supervision though.