Hacker News new | ask | show | jobs
by andy99 793 days ago
Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.
3 comments

Also, the technological leader focuses less on the benchmarks
Interesting claim, is there data to back this up? My impression is that Intel and NVIDIA have always gamed the benchmarks.
NVIDIA needs T models not B models to keep the share price up.
Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along

HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.

Hook this up to LLM arena we will get a better picture regarding how powerful they really are

"graduate student descent"

Ahhh that takes me back!