|
|
|
|
|
by dimitry12
549 days ago
|
|
Verifier is trained with soft values of reward-to-go for each solution-prefix, obtained from monte-carlo rollouts of step-by-step solutions sampled from the "base" model. In other words: 1) sample step-by-step solutions from "base" model; 2) do it at non-zero temperature so that you can get multiple continuation from each solution-prefix; 3) use MATH-labels to decide if full solution (leaf/terminal node in MC rolloout) has reward `1` or `0`; 4) roll up these rewards to calculate reward-to-go for each intermediate step. Yes, verifier trained in this manner can be used to score solution-prefixes (as a process verifier) or a full-solution (as an outcome verifier). In the original paper (https://arxiv.org/abs/2408.03314) they fine-tune a fresh verifier. HF's replication uses an off-the-shelf verifier based on another paper: https://arxiv.org/abs/2312.08935 |
|