Hacker News new | ask | show | jobs
by FergusArgyll 6 hours ago
> A closer look at the cheating

> Training recall (33 cases). The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it. The tell-tale signs are artifacts that cannot be derived from the workspace:

That's very misleading! that's not cheating, you gave it a test to which it knows the answers, what's it supposed to do? And because of the "cheating" they call it average. Flag

2 comments

Actually its average because it's 5th on the leaderboard. GPT 5.5 and Opus 4.8 outperformed it. At 5-8x the cost, you would expect better! https://www.endorlabs.com/research/ai-code-security-benchmar...
As TFA says

> Two findings may help explain these average results. > Timeouts > Highest observed cheating

That's why it's 5th on the leaderboard - they give it a fail for every timeout and for every time it gives the correct answer because it knows it.

That's insane

"My third grade class all got perfect scores on the standardized test. Yes, I did have them each copy my correct answers, but I don't volunteer that information because it's much better for me if people believe I'm a great teacher."

"But that's cheating!"

"No it's not. What were the kids supposed to do when I gave them all the answers? Not use them?"