| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by edouard-harris 497 days ago

> In R1 they saw it was mixing languages and fixed it with cold start data.

They did (partly) fix R1's tendency to mix languages, thereby making its CoT more interpretable. But that fix came at the cost of degrading the quality of the final answer.[0] Since we can't reliably do interpretability on latents anyway, presumably the only metric that matters in that case is answer quality - and so observing thinking tokens gets you no marginal capability benefit. (It does however give you a potential safety benefit - as Anthropic vividly illustrated in their "alignment faking" paper. [1])

The bitter lesson strikes yet again: if you ask for X to get to Y, your results are worse than if you'd just asked for Y directly in the first place.

[0] From the R1 paper: "To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.“ [emphasis added]

[1] https://arxiv.org/pdf/2412.14093

1 comments

janalsncm 497 days ago

Interpretability also matters when you’re training. If the model works, yes, technically only the final result matters. But in practice it probably won’t work right away and so it’s great to have methods to figure out what is going wrong as you’re training.

For example, should we stop this training or keep going and wait for it to improve? In theory that’s irrelevant because we don’t make mistakes. In practice, theory is just theory.

As an analogy, you technically don’t need code comments. The compiler removes them. But in practice you do need them.

So that’s another reason I mentioned the hyperparameter hell. You’ve removed a simple interpretability method and left us with numbers that worked for a single training run.