|
|
|
|
|
by spwa4
424 days ago
|
|
I don't like papers that ask a question in the title, so here's the answer: "RL boosts sampling efficiency but reduces the reasoning capacity boundary." Perhaps better to put it like this: Given one, or few attempts, RL trained models beat non-RL models. Given many attempts, non-RL models come up with better answers. |
|