| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by energy123 60 days ago

> 93.6% (congrats Anthropic)

But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"

0.191 * 0.594 > 1 - 0.936

Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?

2 comments

cjsaltlake 60 days ago

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.

link

stingraycharles 59 days ago

You can trust that a model that scores 40% vs a model that scores 90% is indeed worse.

You can’t trust it that a model that scores 93% is better at software engineering than a model that scores 90%, because at that point it’s impossible to distinguish between recall and reasoning.

link

dannyw 59 days ago

It’s honestly far better to just ignore SWEBench Verified in 2026. Multiple labs have noted issues with contamination, and achieving high scores require memorisation of what passes the prescriptive verifier; not what is a correct solution.

40% vs 90%? Sure.

70% vs 90%? _Absolutely meaningless_ as you are not measuring coding intelligence but “how well can the model cheat flaws in SWEBench Verified”, the former can certainly be better at coding even assuming no deliberate benchmaxxing / foul play.

link

kator 59 days ago

> models that aren't over-optimized for it.

But how do you know the model was over-optimized for it or just really good?

link

MagicMoonlight 59 days ago

This article says anthropic models can write out the entire benchmark solution set word for word from memory

link

kmdupree 59 days ago

i disagree: https://www.philosophicalhacker.com/post/anthropic-error/

link

defmacr0 59 days ago

I don't understand that methodology in the first place. Does Anthropic even have some kind of somewhat objective definition to measure and judge "memorization"? Is there any evidence that other LLMs are viable tool to determine that?

link

fulafel 59 days ago

there's more details under the Too narrow and too wide tests heading.

It would be interesting to see a deeper investigation, into how the models are dealing with this and whether the successful ones seemed to be trained on the benchmark.

link