Evals will break | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Evals will break (wanglun1996.github.io)
	25 points by rajveerb 30 days ago

4 comments

ppeetteerr 30 days ago

The argument in the article is backwards. Evals test the stability and boundaries of a concept. They are not created before the concept has been prototyped (which the author acknowledges).

An eval is not somehow breaking silently due to some new capabilities in an LLM. It wouldn't be a good eval if it did. What it does is steer the LLM towards specific goals. If anything, an argument can be made that they restrict creativity and experimentation by narrowing goals.

If the argument is that evals need to written before some new behavior can be devised, that's incorrect. There are an infinite number of evals that test for things which cannot be done. Only when something has been demonstrated to work in a specific context, can an eval be written.

link

rajveerb 29 days ago

Most of these are addressed?

link

ppeetteerr 28 days ago

They are addressed but the core of the thesis is still wrong:

> This is the core problem: our entire evaluation infrastructure is structurally reactive. We measure the system after it has changed. We never predict the change.

That's kind of the point of evals.

link

rajveerb 30 days ago

I read through this blog post and it's timely given how close the models are to max out the benchmarks/evals.

One thing which was not addressed but will be interesting to discuss would be benchmarks/evals that conflict.

Are there desirable emergent behavior that might not be optimized because the evals penalize them?

link

cowang 30 days ago

AI slop

link

satisfice 30 days ago

“Eval” is not testing. This post is written as is no one ever heard of a thing called testing.

For the uninitiated: if you think testing is nothing more than simple operations and assertions, then you don’t know anything important about testing.

link

satisfice 29 days ago

Look at that? Downvoted. Probably by no one who has ever studied testing.

link