| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jstanley 469 days ago
	I keep seeing people mention "illegible reasoning" but I'd be fascinated to see an example of what it actually looks like. Do you have any examples? Apparently DeepSeek-R1 can switch between English, Chinese, and gibberish, and even the gibberish helps it think! That's fascinating, but all I can find is people saying it, nobody showing it.

2 comments

Imnimo 469 days ago

Here's an example of language switching:

https://gr.inc/question/although-a-few-years-ago-the-fundame...

In the dropdown set to DeepSeek-R1, switch to the LIMO model (which apparently has a high frequency of language switching).

I'm not sure about examples of gibberish or totally illegible reasoning. My guess is that since R1-Zero still had the KL penalty, it should all be somewhat legible - the KL penalty encourages the model to not move too far from what the base model would say in any given context.

link

jstanley 469 days ago

Thanks, that's cool to see. I hadn't seen this site before but browsing around I also found this example: https://gr.inc/question/why-does-the-professor-say-this-good... - also with LIMO.

link

pizza 469 days ago

Seems like if you want to stay in the same language, you could just add a verifiable rewards term for that w/o having to fully load up on the baggage of a base model KL penalty.

link

kcorbitt 469 days ago

Yep. And tbh you probably don't even have to do this; the R1 paper found that just running SFT the base model with a relatively small number of monolingual reasoning traces was enough for it to get the idea and iirc they didn't even bother selecting for language specifically in the RL training looop itself.

link

NitpickLawyer 469 days ago

Don't have examples handy, but I did a round of grpo on a 7b model and it did indeed start to switch between english, coreean and chinese, but the reward was steadily increasing. RL doesn't care what the middle tokens are, as long as the end result gets the carrot.

I think there's still a lot to learn about reward functions (saw a team work w/ just correct output, and nothing else), if you should reward partial success (i.e. code compiles / math outputs a result) or just the final thing (i.e. test cases pass / correct answer) and so on.

Not to mention how to get downstream signals from e2e tasks (i.e. if an "agent" navigates to the correct webpage and finds a "cookie" or something, figure out how to reward all the intermediary steps out of that single binary signal).

And there's a lot to learn in using grammars & stuff w/ RL as well. The problem there is that the libraries are pretty wonky atm, some things work, some things need work, and RL in itself is pretty slow due to having to generate, update the model and generate again.

link