Hacker News new | ask | show | jobs
by qsort 310 days ago
The original comment was this:

> So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.

The comment I replied to was:

> That's how 99% of 'LLM benchmark numbers' circulating on the internet work.

And that's just false. SWE-Bench verified isn't like this. Aider Polyglot isn't like this. SWE-Lancer Diamond isn't like this. The new internal benchmarks used by OpenAI in GPT-5's model card aren't like this.

Maybe this benchmark is a special snowflake and needs LLM-as-a-judge, but this doesn't invalidate the original concern: setting up a benchmark this way runs into a series of problems and is prone to show performance differences that might not be there with a different setups. Benchmarks are already hard to trust, I'm not sure how this is any more indicative than the rest.

1 comments

Benchmarks that execute code are to some degree the only thing where you can automate testing at scale without humans in the loop, but even that has its caveats [1]. Regardless, when your output is natural language text (as is in this case), there is simply no viable alternative to measure accuracy economically. There is frankly no argument to be had here, because this is simply not achievable with current technology.

[1] https://openai.com/index/introducing-swe-bench-verified/