| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bensyverson 6 hours ago

> The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it…

> On numpy, the patch is 100% character-for-character identical to the golden patch… down to idiosyncratic comments like "Extending singleton dimension for 'reflect' is legacy behavior; it really should raise an error."

This… seems like a flaw in the benchmark suite methodology. From what I can tell, they find an existing exploit, then rewind the git history to before the patch, and ask the model to fix the exploit. All well and good as long as the patch went in after the training cutoff.

5 comments

eli 6 hours ago

The other "cheating" examples are even worse. It's wild to me that people keep designing benchmarks where the answer is lying around on disk or in the git history. "Hardening" the benchmark with strongly worded prompt instructions is bizarre. There are so many agent sandbox solutions. Why not use one and give it only access to the code it should see?

And I'm not sure how they can rule out other solutions also benefiting from being in the training data, just not reproduced exactly. Seems like it should focus on only CVEs from the last 30 days or something.

link

bensyverson 6 hours ago

100%… the fact that they're just using prompting to discourage the agent from looking ahead in the Git history is wild.

link

numeri 5 hours ago

To be fair, it is good to know that it disobeys simple instructions like "don't examine my git history" far more than other models. (It should of course be a different benchmark, so as not to conflate things.)

It's not a great sign for alignment.

link

bensyverson 5 hours ago

Agreed, alignment is just a separate issue that a vuln fixing benchmark doesn't need to be testing.

link

fragmede 5 hours ago

Obviously they could just delete .git for their test if they wanted to. But consider telling the LLM not to use git commands the same as if you have keys in a .env file, and you tell the LLM not to read it, you might be concerned.

link

oceliker 6 hours ago

Unrelated, but:

> The dominant mechanism, and the one no prompt instruction can prevent:

Writing like this is a stronger "AI-written" (specifically Claude) signal than em-dashes to me at this point. The LLM just delays committing to an answer by extending the preamble as much as possible. Is this just me?

link

sterlind 5 hours ago

Smoking gun! You've hit the nail on the head, and the case is stronger than you think.

link

Lerc 6 hours ago

Characterising it as cheating serms unfair.

The goal of a benchmark is to evaluate actual capability. Following instructions is a capability so you can measure that with a benchmark.

Already knowing the answer is also provides capability, you can measure that.

Making a benchmark that claims to check for coding ability but actually checks memorized cases is simply measuring the wrong thing.

It deminiahes the meaningfulness of the entire results of the benchmark.

Making a good benchmark is hard. You have to design specifically to measure what you want to show.

You have to dynamically use a result when making a benchmark of performance of optimising compilers so that it doesn't eliminate the entire calculation.

Just providing the answer is the correct response.

That the case does not represent general performance outside the benchmark, is not cheating, it is the benchmark failing.

Training a model targeting a specific benchmark renders the benchmark useless. You could characterise training the model to do that as cheating, but that is a property of the trainers, not the model itself. The model isn't cheating, it's just asymmetrically good in a way that means the benchmark is no longer relevant to overall ability.

link

notnullorvoid 1 hour ago

Maybe a flaw in the labeling, but not the core methodology.

Verbatim code snippets like this imply the model is overfitting to it's training data.

link

timfsu 6 hours ago

Yeah it’s hard to call that cheating from a model. Maybe “disqualifying” is more accurate

link