| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by with 357 days ago
	It’s a widely accepted eval technique and it’s called “llm as a judge”

4 comments

jacquesm 357 days ago

Accepted does not mean correct. It's like using a rubber yardstick as the means to figure out who won the pumpkin growing competition.

link

ben_w 357 days ago

I'd say it's worse than that, a rubber ruler still has a definite length when not under tension etc.

This might be more like asking amateur painters to each paint a picture of a different one of the pumpkins, then judging each other's paintings without seeing the actual pumpkin that painting was based on.

link

jacquesm 357 days ago

Ok, that is indeed better. For a further improvement we should let the previous generation of paintings judge the new one.

link

kingstnap 357 days ago

It's widely accepted because it's cheap, but LLMs aren't really good judges.

It's supposed to leverage a "generate vs. critique" gap in skill level as a form of self-improvement. It's easier to judge how good food is vs. make it.

But here's the thing. When it comes to code review, you need to be effectively as skilled as the person who wrote it. There isn't really a gap.

And then the real clincher is this. LLMs naturally have a skill gap between their judgement and generation skills as is. The reason is that they have superhuman pattern matching and memorization ability. They can use their memorized patterns as a massive crutch for their actual reasoning skills, but they can't do the same for judgement calls in code review.

link

sensanaty 357 days ago

Accepted by whom, the people shoving AI down our throats?

link

magicalhippo 357 days ago

Shouldn't one review the ratings of say a random 1% to ensure it's performing as expected?

link