| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by measurablefunc 147 days ago
	There is no RL for programming languages. Especially ones w/ no significant amount of code.

3 comments

nl 147 days ago

I guess the op was implying that is something fixable fairly easily?

(Which is true - it's easy to prompt your LLM with the language grammar, have it generate code and then RL on that)

Easy in the sense of "it is only having enough GPUs to RL a coding capable LLM" anyway.

link

measurablefunc 147 days ago

If you can generate code from the grammar then what exactly are you RLing? The point was to generate code in the first place so what does backpropagation get you here?

link

nl 147 days ago

Post RL you won't need to put the grammar in the prompt anymore.

link

measurablefunc 147 days ago

The grammar of this language is no more than a few hundred tokens (thousands at worst) & current LLMs support context windows in the millions of tokens.

link

nl 147 days ago

Sure.

The point is that your statement about the ability to do RL is wrong.

Additionally your response to the Deepseek paper in the other subthread shows profound and deliberate ignorance.

link

measurablefunc 146 days ago

Theorycrafting is very easy. Not a single person in this thread has shown any code to do what they're suggesting. You have access to the best models & yet you still haven't managed to prompt it to give you the code to prove your point so spare me any further theoretical responses. Either show the code to do exactly what you're saying is possible or admit you lack the relevant understanding to back up your claims.

link

thorum 147 days ago

Go read the DeepSeek R1 paper

link

measurablefunc 147 days ago

Why would I do that? If you know something then quote the relevant passage & equation that says you can train code generators w/ RL on a novel language w/ little to no code to train on. More generally, don't ask random people on the internet to do work for you for free.

link

thorum 147 days ago

Your other comment sounded like you were interested in learning about how AI labs are applying RL to improve programming capability. If so, the DeepSeek R1 paper is a good introduction to the topic (maybe a bit out of date at this point, but very approachable). RL training works fine for low resource languages as long as you have tooling to verify outputs and enough compute to throw at the problem.

link

measurablefunc 147 days ago

So you should have no problem bringing up the exact passages & equations they use for their policies.

link

whimsicalism 147 days ago

imo generally not worth it to keep going when you encounter this sort of HN archetype

link

whimsicalism 147 days ago

well, that’s one way to react to being provided with interesting reading material.

link

measurablefunc 147 days ago

Bring up passage that supports your claim. I'll wait.

link

nl 146 days ago

Not exactly sure what you are looking for here.

That GRPO works?

> Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase

Page 2 of https://arxiv.org/pdf/2402.03300

That GRPO on code works?

> Similarly, for code competition prompts, a compiler can be utilized to evaluate the model’s responses against a suite of predefined test cases, thereby generating objective feedback on correctness

Page 4 of https://arxiv.org/pdf/2501.12948

link

measurablefunc 146 days ago

None of those are novel domains w/ their own novel syntax & semantic validators, not to mention the dearth of readily available sources of examples for sampling the baselines. So again, where does it say it works for a programming language with nothing but a grammar & a compiler?

link

whimsicalism 147 days ago

not even wrong

link

measurablefunc 147 days ago

Exactly.

link