| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by spwa4 424 days ago

I don't like papers that ask a question in the title, so here's the answer:

"RL boosts sampling efficiency but reduces the reasoning capacity boundary."

Perhaps better to put it like this: Given one, or few attempts, RL trained models beat non-RL models. Given many attempts, non-RL models come up with better answers.

2 comments

sitkack 424 days ago

My gut feeling when using DeepSeek is that its performance is a lot smoother, the responses feel more robust and not as brittle.

link

cma 424 days ago

At least with Deep Seek math (with the same RL technique as the later R1) they noted similar things in their paper in the "Why RL Works?" section. Around the 1:04:00 mark of this Yannic Kilcher video review of the Deepseek math paper he goes over that section and points to basically the same limitations as the hn submission paper, starts at around the 1hr 4m mark and ends with this:

    1:05:40
    the Improvement is attributed to boosting the correct response from Top K
    1:05:46
    rather than the enhancement of fundamental capabilities this is something that we've come to learn in a
    1:05:52
    lot of different ways from like reinforcement learning on language
    1:05:58
    models or even supervised fine-tuning is that what's happening most likely is
    1:06:04
    more that the capabilities of doing all of these things are already present in
    1:06:09
    the underlying pre-trained language model

https://www.youtube.com/watch?v=bAWV_yrqx4w&t=1h4m

from the paper:

> 5.2.2. Why RL Works? > In this paper, we conduct reinforcement learning based on a subset of instruction tuning data, and it achieves significant performance enhancement upon the instruction tuning model. > To further explain why reinforcement learning works. We evaluate the Pass@K and Maj@K accuracy of the Instruct and RL models on two benchmarks. As shown in Figure 7, RL enhances Maj@K’s performance but not Pass@K. These findings indicate that RL enhances the model’s overall performance by rendering the output distribution more robust, in other words, it seems > that the improvement is attributed to boosting the correct response from TopK rather than the enhancement of fundamental capabilities. Similarly, (Wang et al., 2023a) identified a misalignment problem in reasoning tasks within the SFT model, showing that the reasoning performance of SFT models can be improved through a series of preference alignment strategies > (Song et al., 2023; Wang et al., 2023a; Yuan et al., 2023b).

In the video he reads into this that these methods alone may not at all get us over the data wall and are still fundamentally limited by the distribution of the base model they augment.

link

GloamingNiblets 424 days ago

Thanks for sharing. I had trouble reading the transcript, so here is Claude's cleaned up version and summary:

Here's the condensed and formatted transcription in a single paragraph: This is the last thing I want to highlight this section on why RL works. Here they evaluate different things - they evaluate specifically pass at K and maj at K. Maj at K is like majority voting, so what you do is you have a model, you have a question, and you output not just one output but an ordered set. So you give your top 20 answers - 0 is your best answer that the model wants to give most, then the second most answer, third most answer, and so on. They could all be correct, just different reformulations of the same answer or different derivations stated in different ways. What you're interested in is how many of the top K results are correct - that's the pass at K. And if you had to vote if majority voting on the top K, how often would you be correct then? There's a slight difference, and that slight difference is actually made more drastic by reinforcement learning. They say, "As shown in figure 7, reinforcement learning enhances majority at K performance but not pass at K." These findings indicate that reinforcement learning enhances the model's overall performance by rendering the output distribution more robust. In other words, it seems that the improvement is attributed to boosting the correct response from Top K rather than the enhancement of fundamental capabilities. This is something we've come to learn in many different ways from reinforcement learning on language models or even supervised fine-tuning - what's happening most likely is that the capabilities of doing all of these things are already present in the underlying pre-trained language model. Summary: Reinforcement learning improves language model performance not by enhancing fundamental capabilities but by making the output distribution more robust, effectively boosting correct responses within the top results rather than improving the model's inherent abilities.

link

spwa4 423 days ago

Just don't.

This is a horrible summary. It is both too complex and to simple at the same time. This summary spends about half it's time talking about pass@k while failing to explain what it is and giving a great deal of good-sounding but misleading statements, making me think Claude completely misunderstood (it is absolutely not like majority voting). Pass@k means you get k attempts to answer a question. Right? You passed. Wrong? Well, you've got k (for example 10) attempts.

The paper itself is much better. Hell, the conclusion of the paper is so much better than what you have here.

Here's a decent summary, directly from the paper's conclusion:

1. RL-trained models perform worse than base models in pass@k at large k values. (note that Claude's explanation of what pass@k is in the parent post is extremely wrong)

2. RL boosts sampling efficiency but reduces the reasoning capacity boundary.

3. RLVR algorithms perform similarly and remain far from optimal.

4. RLVR and distillation are fundamentally different.

And here's a one-line summary from me:

This paper claims that RL(VR) training is like improving the model's search algorithm: it becomes (a lot) better at locating a good answer within the model, but also pushes the model too hard to give only this answer.

Before Claude makes another absurd claim RL = reinforcement learning (for example, for safety. Say, trying to get the model to explain breaking into a car, if it ever does, that's bad), RLVR = reinforcement learning with verifiable rewards (meaning you get to think as much as you want, as long as your final answer is correct. But you get to reminisce/think as much as you want before giving a final answer, and that thinking does not have to be relevant)

And a comment: this is exactly what you'd expect to see from mild overtraining of the model. It could be that the current big players are pushing the models to be right/helpful/safe too hard, and taking away too much "freedom" in the process.

link

GloamingNiblets 423 days ago

I appreciate the feedback, another reminder to not lean too much on LLMs.

link

mountainriver 423 days ago

This also seems to be why rejection sampling + SFT seems just as good if not better in a lot of scenarios

link

whatshisface 424 days ago

I don't know a lot about this but it seems like if the sampling performance was adequate, external checks like theorem verification would work to get "over the data wall."

link

cma 423 days ago

There have already been good results there with DeepMind's math Olympiad work. I think the LLM portion there was only for translating from informal to formal in the training process and in the final process they still used a manual translation to a formal description and the solver was transformer based and RL trained, but I think not starting with any language base, but it was able to learn some distribution helpful in solving the problems with RL, verifier,and light scaffolding of the tree search alone.

link

cma 424 days ago

I'm pretty sure RL causes catastrophic forgetting of its base knowledge and that's why o3 hallucinates so much more.

If you mess around with trained weights you're going to delete some base knowledge, as least the knowledge that is outside of the tasks you RL on.

link

riku_iki 423 days ago

Solution could be to mix RL training with foundational knowledge training, so LLM can refresh memory and not forget things.

link

zaptrem 423 days ago

I think when they were figuring out RLHF they avoided this by interleaving RLHF and normal cross entropy on training set gradients.

link

kadushka 424 days ago

Hallucinations usually happen when a model never knew the answer, not when it forgot something.

link

cma 424 days ago

I think this is definitely not true of catastrophic forgetting from finetuning. And with other related types of forgetting from model abliteration there are often extreme increases hallucination.

The InstructGPT paper also showed that RLHF made hallucination worse (with more user data rejecting common hallucinations instruction tuning and RLHF may lower specific hallucinations rejected by users though).

Some mention of that here: https://huyenchip.com/2023/05/02/rlhf.html#rlhf_and_hallucin...

link

kadushka 423 days ago

RL might be making hallucinations worse, that’s true. Why do you think RL is causing catastrophic forgetting? Are there factual knowledge benchmarks showing it for o3 or o4-mini?

link

cma 423 days ago

Just since any continued training tends to cause catastrophic forgetting if the old info isn't regurgitated again.

Not specifically showing catastrophic forgetting, but hallucination for o3:

    >  From the results of this evaluation, o3's hallucination rate is 33 percent, and o4-mini's hallucination rate is 48 percent — almost half of the time. By comparison, o1's hallucination rate is 16 percent, meaning o3 hallucinated about twice as often.

https://mashable.com/article/openai-o3-o4-mini-hallucinate-h...

Deepseek R1 handles some of this by redistilling back in "factual Q&A" generated from original V3 model to make a new V3. The V3 paper mentions it incorporated an R1 pass too so it seems like: V3 base model, RL pass, V3 with RL distill and retraining a checkpoint for the final V3 release, additional RL pass for the final R1 release.

V3 Paper

> During the post-training stage, we distill the reasoning capability from the DeepSeekR1 series of models [I think that refers to the earlier checkpoint R1 after the first pass below]

R1 Paper:

> To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.

In general with fine-tuning you can avoid catastrophic forgetting by mixing in the original data during later fine tuning steps, and from this it seems the same is true of the RL phases, but they are also doing some amount of augmentation and selection on the the data involved.

link

mountainriver 423 days ago

There has always been a post training phase with RLHF though since GPT 3.5

It’s nothing new, and it’s worked great for a long time. The difference now is RLVR, which yes I do suspect is causing it to over optimize to verifiable tasks and is probably losing a lot of nuanced information

link

kadushka 423 days ago

Catastrophic forgetting would degrade factual knowledge benchmark results - even more than hallucination benchmarks, right? Do we observe this with o3/o4-mini? If not, your hypothesis is invalidated.

link