| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alextheparrot 383 days ago

LLMs evaluating LLM outputs really isn’t that dire…

Discriminating good answers is easier than generating them. Good evaluations write test sets for the discriminators to show when this is or isn’t true. Evaluating the outputs as the user might see them are more representative than having your generator do multiple tasks (e.g. solve a math query and format the output as a multiple choice answer).

Also, human labels are good but have problems of their own, it isn’t like by using a “different intelligence architecture” we elide all the possible errors. Good instructions to the evaluation model often translate directly to better human results, showing a correlation between these two sources of sampling intelligence.

5 comments

majormajor 383 days ago

> Discriminating good answers is easier than generating them.

I don't think this is true for many fields - especially outside of math/programming. Let's say the task is "find the ten most promising energy startups in Europe." (This is essentially the sort of work I see people frequently talk about using research modes of models for here or on LinkedIn.)

In ye olden days pre-LLM you'd be able to easily filter out a bunch of bad answers from lazy humans since they'd be short, contain no detail, have a bunch of typos, formatting inconsistencies from copy-paste, etc. You can't do that for LLM output.

So unless you're a domain expert on European energy startups you can't check for a good answer without doing a LOT of homework. And if you're using a model that usually only looks at, say, the top two pages of Google results to try to figure this out, how is the validator going to do better than the original generator?

And what about when the top two pages of Google results start turning into model-generated blogspam?

If your benchmark can't evaluate prospective real-world tasks like this, it's of limited use.

A larger issue is that once your benchmark, that used this task as a criteria, based on an expert's knowledge, is published, anyone making an AI Agent is incredibly incentivized to (intentionally or not!) to train specifically on this answer without necessarily actually getting better at the fundamental steps in the task.

IMO you can never use an AI agent benchmark that is published on the internet more than once.

jgraettinger1 383 days ago

> You can't do that for LLM output.

That's true if you're just evaluating the final answer. However, wouldn't you evaluate the context -- including internal tokens -- built by the LLM under test ?

In essence, the evaluator's job isn't to do separate fact-finding, but to evaluate whether the under-test LLM made good decisions given the facts at hand.

majormajor 383 days ago

I would if I was the developer, but if I'm the user being sold the product, or a third-party benchmarker, I don't think I'd have full access to that if most of that is happening in the vendor's internal services.

alextheparrot 383 days ago

> Good evaluations write test sets for the discriminators to show when this is or isn’t true.

If they can’t write an evaluation for the discriminator I agree. All the input data issues you highlight also apply to generators.

brookst 383 days ago

> IMO you can never use an AI agent benchmark that is published on the internet more than once.

This is a long-solved problem far predating AI.

You do it by releasing 90% of the benchmark publicly and holding back 10% for yourself or closely trusted partners.

Then benchmark performance can be independently evaluated to determine if performance on the 10% holdback matches the 90% public.

tempfile 383 days ago

> Discriminating good answers is easier than generating them.

This is actually very wrong. Consider for instance the fact that people who grade your tests in school are typically more talented, capable, trained than the people taking the test. This is true even when an answer key exists.

> Also, human labels are good but have problems of their own,

Granted, but...

> it isn’t like by using a “different intelligence architecture” we elide all the possible errors

nobody is claiming this. We elide the specific, obvious problem that using a system to test itself gives you no reliable information. You need a control.

alextheparrot 383 days ago

It isn’t actually very wrong. Your example is tangential as graders in school have multiple roles — teaching the content and grading. That’s an implementation detail, not a counter to the premise.

I don’t think we should assume answering a test would be easy for a Scantron machine just because it is very good at grading them, either.

tempfile 380 days ago

No. Graders having multiple roles is actually the implementation detail, since they're people, and they can't spend all day grading work. Scanning machines don't really grade work either, but I am happy to rely on them for checking an answer matches a scheme verbatim. I'm not sure why you mention scanners answering tests either, since my original comment doesn't imply that.

There is no evidence that an LLM can reliably evaluate the semantic content of a sentence, even in cases where we all agree that the semantic content exists. The thread we are participating in demonstrates a particularly egregious failure, but there is no good reason to think that more subtle failures might not exist if we happen to patch this one. Even if they were reliable, you can't evaluate a system with itself - that is basic science.

rf15 383 days ago

Trading control for convenience has always been the tradeoff in the recent AI hype cycle and the reason why so many people like to use ChatGPT.

tempfile 380 days ago

Not "control", "a control". As in a control group, for a study.

diggan 383 days ago

> Discriminating good answers is easier than generating them.

Lots of other good replies to this specific part, but also, lots of developers are struggling with the feeling that reviewing code is harder than writing code (something I personally not sure I agree with), seen that sentiment being shared here on HN a lot, and would directly go against that particular idea.

alextheparrot 383 days ago

I wish the other replies and this would engage with the sentence right after it indicating that you should test this premise empirically.

suddenlybananas 383 days ago

What's 45+8? Is it 63?

alextheparrot 383 days ago

If this sort of error isn’t acceptable, it should be part of an evaluation set for your discriminator

Fundamentally I’m not disagreeing with the article, but also think most people who care take the above approach because if you do care you read samples, find the issues, and patch them to hill climb better

e1g 383 days ago

Agree, current "thinking" models are effectively "re-run this question N times, and determine the best answer", and this LLM-evaluating-LLM loop demonstrably leads to higher quality answers against objective metrics (in math, etc).

brookst 383 days ago

That’s… not how thinking models work. They tend to be iterative and serial, not parallel and then pick-one.

e1g 383 days ago

Parallel test time compute is exactly what SOTA models do, including Claude 4 Opus extended, o3 Pro, Grok 4 Heavy, and Gemini 2.5 Pro.