Hacker News new | ask | show | jobs
by with 310 days ago
It’s a widely accepted eval technique and it’s called “llm as a judge”
4 comments

Accepted does not mean correct. It's like using a rubber yardstick as the means to figure out who won the pumpkin growing competition.
I'd say it's worse than that, a rubber ruler still has a definite length when not under tension etc.

This might be more like asking amateur painters to each paint a picture of a different one of the pumpkins, then judging each other's paintings without seeing the actual pumpkin that painting was based on.

Ok, that is indeed better. For a further improvement we should let the previous generation of paintings judge the new one.
It's widely accepted because it's cheap, but LLMs aren't really good judges.

It's supposed to leverage a "generate vs. critique" gap in skill level as a form of self-improvement. It's easier to judge how good food is vs. make it.

But here's the thing. When it comes to code review, you need to be effectively as skilled as the person who wrote it. There isn't really a gap.

And then the real clincher is this. LLMs naturally have a skill gap between their judgement and generation skills as is. The reason is that they have superhuman pattern matching and memorization ability. They can use their memorized patterns as a massive crutch for their actual reasoning skills, but they can't do the same for judgement calls in code review.

Accepted by whom, the people shoving AI down our throats?
Shouldn't one review the ratings of say a random 1% to ensure it's performing as expected?