| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mikkelam 134 days ago
	Why are we treating LLM evaluation like a vibe check rather than an engineering problem? Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning?

4 comments

ainch 134 days ago

I don't think it's just an engineering problem - decades of research have failed to produce a convincing, general definition of intelligence, capability or agency. You can try to form proxy metrics by combining benchmarks, but existing benchmarks are flawed, and should be taken with a pinch of salt.

It's evident in the fact that every time AI has historically met certain thresholds (chess-playing, the Turing Test, fluent language), we play with them a little more and find out there's something still lacking.

link

tanaros 134 days ago

Whenever somebody makes a benchmark, people complain that the benchmark results are meaningless because they’re gamed. I don’t know why those same people don’t understand that grading on vibes is strictly worse.

link

tintor 134 days ago

Depends on benchmark.

If questions are fixed they are trivial to game.

link

pizza 134 days ago

There’s a Dark Forest problem for evals. As soon as they’re made public they start running out of time to be useful. It’s also not clear how to predict how the model will perform on a task based on an eval. Or even whether, given two skills that the model can individually do well on in the evals, it still does well on their composition. It might at this point be better to be scientific in unscientific approaches, than to attribute more power to relatively weakly predictive evals than they actually have

link

bisonbear 134 days ago

I agree with your analysis but not the conclusion.

Evals are broken - OpenAI showed that SWE Bench Verified was in the training data - models were able to reconstruct the changes from memory (https://openai.com/index/why-we-no-longer-evaluate-swe-bench...)

However, this doesn't mean we should completely give up on benchmarking. In fact, as models get more intelligent, and we give them more autonomy, I believe that tracking agent alignment to your coding standards becomes even more important.

What I've been exploring is making a benchmark that is unique per-repo - answering the question of how does the coding agent perform in my repo doing my tasks with my context. No longer do we have to trust general benchmarks.

Of course there will still be difficulties and limitations, but it's a step towards giving devs more information about agent performance, and allowing them to use that information to tweak and optimize the agent further

link

H8crilA 134 days ago

Someone else already wrote it, but it's just too funny to not abuse:

Evals are bad because people learn and fit to them. So we do extremely small evals instead.

link

xandrius 134 days ago

Is "Dark Forest problem" an actual name? I just heard of the hypothesis and it has nothing to do with how you used it in this context.

link

pizza 134 days ago

I meant in the sense of - you have benchmarkers and trainers. If you publicize your evaluation, trainers may likely have their models 'consume' it, even if only indirectly: another person creating their own benchmark from scratch may be influenced by yours, even if the new question sets are clean-room. That, and the rule of thumb that benchmark value dissipates like sqrt(age) [0]

So there is a definite advantage to never publicizing your internal benchmark. But then, no one else can replicate your findings. You should assume that the space of benchmarks that are actually decent at evaluating model performance is much larger and most of the good ones, the ones that were costliest to produce, are hidden, and might not even correspond very well with the public ones. And that the public expensive benchmarks are selective and have a bias towards marketing purposes.

[0] https://www.offconvex.org/2021/04/07/ripvanwinkle/

link

sebastiennight 134 days ago

I believe the correct term is "Goodhart's Law": https://en.wikipedia.org/wiki/Goodhart%27s_law

link

Culonavirus 134 days ago

I mean, you vibe check, then you vibe code. Makes perfect sense. (this is a joke)

link