Hacker News new | ask | show | jobs
by dodslaser 581 days ago
What else should you train for? If the benchmark dosn't represent real world scenarios, isn't that a problem with the benchmark rather than the model?
4 comments

If your benchmark covers all possible programming tasks then you dont need an llm, you need search over your benchmark.

Hypothetically let's say the benchmark contains "test divisibility of this integer by n" for all n of the form 3x+1. An extremely overfit llm won't be able to code divisibility for all n not of the form 3x+1, and your benchmark will never tell.

No, because solving a well defined problem with well defined right or wrong is generally not what people use llm for. Most of the times my query to llm is underspecified, and lot of time I figure out the problem when chatting with LLM. And benchmark by definition only measures just right/wrong answer.
This is called Goodhart's law, who said: "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."

But in modern usage it is often rephrased to: "When a measure becomes a target, it ceases to be a good measure"

https://en.wikipedia.org/wiki/Goodhart%27s_law

Overfitting is a concern.