Hacker News new | ask | show | jobs
by comex 313 days ago
> Each model’s responses are ranked by a high-performing judge model — typically OpenAI’s o3 — which compares outputs for quality, relevance, and clarity. These rankings are then aggregated to produce a performance score.

So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.

8 comments

That's how 99% of 'LLM benchmark numbers' circulating on the internet work.
No, they aren't. Most benchmarks use ground truth, not evaluation by another LLM. Using another LLM as verifier, aside from the obvious "quis custodiet custodes ipsos", opens an entire can of worms, such as the fact that there could be systematic biases in the evaluation. This is not in and of itself disqualifying but it should be addressed, and the article doesn't even say anything.
Even the benchmarks for maths only checked numerical answers for ground truth, which means the LLM can output a lot of nonsense and guess the correct answer to pass it
Ground truth evaluation is not that simple unless you are doing multiple-choice-style tests or something similar where the correctness of an answer can be determined by a simple process. Open ended natural language tasks like this one are incredibly difficult to evaluate and using LLMs as judge is not just the current standard, it is basically the only way to do it at scale economically.
The original comment was this:

> So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.

The comment I replied to was:

> That's how 99% of 'LLM benchmark numbers' circulating on the internet work.

And that's just false. SWE-Bench verified isn't like this. Aider Polyglot isn't like this. SWE-Lancer Diamond isn't like this. The new internal benchmarks used by OpenAI in GPT-5's model card aren't like this.

Maybe this benchmark is a special snowflake and needs LLM-as-a-judge, but this doesn't invalidate the original concern: setting up a benchmark this way runs into a series of problems and is prone to show performance differences that might not be there with a different setups. Benchmarks are already hard to trust, I'm not sure how this is any more indicative than the rest.

Benchmarks that execute code are to some degree the only thing where you can automate testing at scale without humans in the loop, but even that has its caveats [1]. Regardless, when your output is natural language text (as is in this case), there is simply no viable alternative to measure accuracy economically. There is frankly no argument to be had here, because this is simply not achievable with current technology.

[1] https://openai.com/index/introducing-swe-bench-verified/

Also, using an OpenAI model to judge the performance of an OpenAI model seems prone to all kinds of biases.
Am I missing something? If LLM-1 is supposed to judge LLM-2, doesn't LLM-1 have to be better than LLM-2? If LLM-1 is only 40% as good at coding as LLM-2, why would you trust the LLM with the lesser knowledge?
At the heart of the P vs NP problem lies the observation that solution verification seems to be much easier than solution generation. If that applies in this context is another question but I think it is not unreasonable to assume that the judge needs to be less powerful than the performer.

Or in other words, I don't need to be a chef myself to decide if a meal is good or not.

That really doesn't hold for all problems. You can imagine any number of problems where a valid solution is easier, complexity wise, to generate than it is to validate. A trivial example is semiprime factorization. Easy to generate any semiprime, hard to factor.
Sure, it was never my intention to make it seem like a general statement, just highlighting that there is a large class of problems for which it is true.

As you point out there are many problems that higher complexity classes than NP.

> That really doesn't hold for all problems.

But it does hold for this problem.

How so? Asking LLMs to solve a problem can be a problem of any form. For example I just asked this.

Can you give me a very large semiprime?

And claude opus answered:

Here's a very large semiprime:

N = 29927402397991286489627837734179186385188296382227646249397073654051914085318503794952624411151858464246403027505634195232053330357484129331920822220662818816547063469215394303721576869467659309978113411955550111870966028627418736664

This is a over 200-digit semiprime. Factoring semiprimes of this size is computationally intensive, which is why they form the basis of RSA encryption security.

---

Verifying whether this answer is correct is very hard, much harder than generating it.

Problems of this form come up very often. Not even in formal mathematics. Some magic number in the code that you need to reverse engineer to tell it's correct. Some library which you don't have the documentation for but was available when it was written. Hidden intentions or even requirements that are not clear from the code itself. If a weaker LLM is validating a stronger LLM the weaker LLM will simply not grasp the subtleties the stronger LLM created in it's answer. In fact it's a pretty common statement that writing code is easier than reading it. Which is precisely about generation vs validation.

Pretty sure they know that, their point still stands
It's a bit different for reasoning LLMs - they operate in a feedback loop, measuring the quality of the solution and iterating on it until either the quality meets a desired threshold, or all reasoning effort is expended.

This can correct for generation errors, but cannot correct for quality measurement errors, so the question is valid.

It's usually easier to create a false statement than to check whether it's false.
At least use something like Zen MCP’s Consensus tool to gain a consensus around a large variety of models.
Exactly, they should at least compare with judges as best models from others, ideally verified by human/ground truth/tests.
Yes, especially as models are known to have a preference towards outputs of models in the same family. I suspect this leaderboard would change dramatically with different models as the judge.
I don't care about either method. The ground truth should be what a human would do, not what a model does.
There may be different/better solutions for almost all those kind of tasks. I wouldn’t be surprised if optimal answer to some of them would be refusal/defer ask, refactor first, then solve it properly.
That response is quite in line with the typical human based PR response on a first draft.

There is a possibility that machine based PR reviews are better: for instance because they are not prejudiced based on who is the initiator of the PR and because they don't take other environmental factors into account. You'd expect a machine to be more neutral, so on that front the machine should and possibly could score better. But until the models consistently outperform the humans in impartially scored quality vs a baseline of human results it is the humans that should call this, not the machines.

I wouldn't necessarily expect a machine to be more neutral. Machines can easily be biased too.
On something like a PR review I would. But on anything that would involve private information such as the background, gender, photographs and/or video as well as other writings by the subject I think you'd be right.

It's just that it is fairly trivial to present a PR to a machine in such a way that it can only comment on the differences in the code. I would find it surprising if that somehow led to a bias about the author. Can you give an example of how you think that would creep into such an interaction?

They are different models already but yes, I already let ChatGPT judge Claude's work for the same reason.
It’s a widely accepted eval technique and it’s called “llm as a judge”
Accepted does not mean correct. It's like using a rubber yardstick as the means to figure out who won the pumpkin growing competition.
I'd say it's worse than that, a rubber ruler still has a definite length when not under tension etc.

This might be more like asking amateur painters to each paint a picture of a different one of the pumpkins, then judging each other's paintings without seeing the actual pumpkin that painting was based on.

Ok, that is indeed better. For a further improvement we should let the previous generation of paintings judge the new one.
It's widely accepted because it's cheap, but LLMs aren't really good judges.

It's supposed to leverage a "generate vs. critique" gap in skill level as a form of self-improvement. It's easier to judge how good food is vs. make it.

But here's the thing. When it comes to code review, you need to be effectively as skilled as the person who wrote it. There isn't really a gap.

And then the real clincher is this. LLMs naturally have a skill gap between their judgement and generation skills as is. The reason is that they have superhuman pattern matching and memorization ability. They can use their memorized patterns as a massive crutch for their actual reasoning skills, but they can't do the same for judgement calls in code review.

Accepted by whom, the people shoving AI down our throats?
Shouldn't one review the ratings of say a random 1% to ensure it's performing as expected?
> Hard to tell what to make of that.

It's not hard. You are visiting a website with an .ai domain. You already know what the conclusions will be.

Why is it hard to ignore an attempt to assess reality that is not grounded in reality?
That's an extremely dense question :) (Not pejorative, but conceptual dense).

I had some fun trying to answer it, ignoring fixating on whether or not the premise is true, for argument's sake.

My answer is:

I would think "attempting to assess reality that is not grounded in reality" is hard to ignore due to a combination of "it's what is available," being easy to understand, and seeming useful (decoupled from whether it's really so). As a result, it's hard to ignore because it's what is mostly available to us for consumption and is easy to make "consumable."

I think there is a LARGE overlap in this topic with my pet peeve and hatred of mock tests in development. They are not completely useless, but their obvious flaws and vulnerabilities seem to me to be in the same area: "Not grounded in reality."

Said another way: Because it's what's easy to make, and thus there is a lot of it, creating a positive feedback loop of mere-exposure effect. Then it becomes hard to ignore because it's what's shoved in our face.

It's almost too on the nose to be satire, yet here we are.
It undermines the private benchmark approach if the evaluation is done that way.