| So to my understanding, this work reproduces DeepSeek R1's reinforcement learning mechanism in a very small language model. The AI gets "rewards" (like points) for doing two things correctly: Accuracy : Getting the right answer. For example, math answers must be in a specific format (e.g., inside a box) so a computer can easily check them. For coding problems, test cases verify if the code works. Format : Using the <think> and <answer> tags properly. This forces the AI to organize its responses clearly. So in this case, the training program can extract the model's answer by parsing <answer> tag. We can eval the answer and evaluate if it's correct or not. If it's correct give reward, else: no reward. Create N such answers from a single question, create N reward array. This is enough for the RL algorithm to guide the model to be more smart. |
Instead DeepSeek (with GRPO) seems to just omit that value function entirely and use only sparse rewards. How does this end up being more efficient, since I thought the sparse nature of rewards makes it harder to converge to the optimal policy?