| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by qsort 357 days ago
	No, they aren't. Most benchmarks use ground truth, not evaluation by another LLM. Using another LLM as verifier, aside from the obvious "quis custodiet custodes ipsos", opens an entire can of worms, such as the fact that there could be systematic biases in the evaluation. This is not in and of itself disqualifying but it should be addressed, and the article doesn't even say anything.

2 comments

charlieyu1 357 days ago

Even the benchmarks for maths only checked numerical answers for ground truth, which means the LLM can output a lot of nonsense and guess the correct answer to pass it

link

sigmoid10 357 days ago

Ground truth evaluation is not that simple unless you are doing multiple-choice-style tests or something similar where the correctness of an answer can be determined by a simple process. Open ended natural language tasks like this one are incredibly difficult to evaluate and using LLMs as judge is not just the current standard, it is basically the only way to do it at scale economically.

link

qsort 357 days ago

The original comment was this:

> So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.

The comment I replied to was:

> That's how 99% of 'LLM benchmark numbers' circulating on the internet work.

And that's just false. SWE-Bench verified isn't like this. Aider Polyglot isn't like this. SWE-Lancer Diamond isn't like this. The new internal benchmarks used by OpenAI in GPT-5's model card aren't like this.

Maybe this benchmark is a special snowflake and needs LLM-as-a-judge, but this doesn't invalidate the original concern: setting up a benchmark this way runs into a series of problems and is prone to show performance differences that might not be there with a different setups. Benchmarks are already hard to trust, I'm not sure how this is any more indicative than the rest.

link

sigmoid10 354 days ago

Benchmarks that execute code are to some degree the only thing where you can automate testing at scale without humans in the loop, but even that has its caveats [1]. Regardless, when your output is natural language text (as is in this case), there is simply no viable alternative to measure accuracy economically. There is frankly no argument to be had here, because this is simply not achievable with current technology.

[1] https://openai.com/index/introducing-swe-bench-verified/

link