|
|
|
|
|
by iceman_w
417 days ago
|
|
RL constrains the space of possible output token sequences to what is likely to lead to the correct answer. So we are inherently making a trade-off to reduce variance. A non-RL model will have higher variance, so given enough attempts, it will come up with some correct answers that an RL model can't. |
|