| RL is proving to be a weird science lately : >Spurious Rewards: Rethinking Training Signals in RLVR
### *TL;DR*
We show that you can do RLVR on Qwen2.5-Math models with *completely random or incorrect rewards*, and still get massive math benchmark gains. All of the following spurious rewards give 15-20+ points on MATH-500 when RLVR training Qwen2.5-Math-7B: - RLVR + format reward (reward responses with `\boxed{}`): *+16.4%*
- RLVR + incorrect reward (only incorrect answers rewarded): *+24.6%*
- RLVR + random reward: *+21.4%*
- (as a reference) RLVR + ground-truth reward: + 28.8% How can these spurious rewards possibly work? Can we get similar gains on other models with broken rewards? >Learning to Reason without External Rewards
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. [2] [1] https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking...
[2] https://arxiv.org/abs/2505.19590 |
So the reward value shifting may act as a sort of unintentional regularization technique (similar to adding noise to the discriminator input in GAN archs).