| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by comex 252 days ago

This is a parody but the phenomenon is real.

My uninformed suspicion is that this kind of defensive programming somehow improves performance during RLVR. Perhaps the model sometimes comes up with programs that are buggy enough to emit exceptions, but close enough to correct that they produce the right answer after swallowing the exceptions. So the model learns that swallowing exceptions sometimes improves its reward. It also learns that swallowing exceptions rarely reduces its reward, because if the model does come up with fully correct code, that code usually won’t raise exceptions in the first place (at least not in the test cases it’s being judged on), so adding exception swallowing won’t fail the tests even if it’s theoretically incorrect.

Again, this is pure speculation. Even if I’m right, I’m sure another part of the reason is just that the training set contains a lot of code written by human beginners, who also like to ignore errors.

3 comments

rsynnott 252 days ago

The great Verity Stob (unfortunately, in an article which no longer seems to be online, after the Dr Dobbs Journal website finally went away) referred to this behaviour (by _human_ programmers) as "nailing the corpse in an upright position".

link

automatic6131 251 days ago

https://97-things-every-x-should-know.gitbooks.io/97-things-...

link

rsynnott 250 days ago

This is her quoting the original article, but as far as I can see the original article is lost to the mists of time. Might be on archive.org, I suppose.

link

MakeAJiraTicket 252 days ago

Defensive programming is considered "correct" by the people doing the reinforcing, and is a huge part of the corpus that LLM's are trained on. For example, most python code doesn't do manual index management, so when it sees manual index management it is much more likely to freak out and hallucinate a bug. It will randomly promote "silent failure" even when a "silent failure" results in things like infinite loops, because it was trained on a lot of tutorial python code and "industry standard" gets more reinforcement during training.

These aren't operating on reward functions because there's no internal model to reward. It's word prediction, there's no intelligence.

link

LeifCarrotson 252 days ago

LLMs do use simple "word prediction" in the pretraining step, just ingesting huge quantities of existing data. But that's not what LLM companies are shipping to end users.

Subsequently, ChatGPT/Claude/Gemini/etc will go through additional training with supervised fine-tuning, reinforcement learning with reward functions whether human-supervised feedback (RLHF) or reward functions (RLVR, 'verified rewards').

Whether that fine-tuning and reward function generation give them real "intelligence" is open to interpretation, but it's not 100% plagarism.

link

aoeusnth1 251 days ago

You used the word reinforcing, and then asserted there's no reward function. Can you explain how it's possible to perform RL without a reward function, and how the LLM training process maps to that?

link

MakeAJiraTicket 251 days ago

LLM actions are divorced from that reward function, it's not something they consult or consider. Reward function in that context doesn't make sense.

link

comex 252 days ago

Reinforcement learning by definition operates on reward functions.

link

btown 252 days ago

My suspicion is that the training set features a lot of code with “positive sentiment” in text and comments around it… but where does one find code with “negative” sentiment, followed by code that is the “corrected” version of that code? In programs written for technical interview prep, where handling of edge cases beyond realistic production situations is the norm. A model trained to use negative examples in its training set as guidance would gravitate away from examples that skip exception handling.

In this, at least, AI may very well have copied our worst habits of “learning to the test.”

link