| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andy99 793 days ago
	Once benchmarks exist for a while, they become meaningless - even if it's not specifically training on the test set, actions (what used to be called "graduate student descent") end up optimizing new models towards overfitting on benchmark tasks.

3 comments

acchow 793 days ago

Also, the technological leader focuses less on the benchmarks

link

manmal 793 days ago

Interesting claim, is there data to back this up? My impression is that Intel and NVIDIA have always gamed the benchmarks.

link

jgalt212 793 days ago

NVIDIA needs T models not B models to keep the share price up.

link

karmasimida 793 days ago

Even random seed could cause bad big shift in human eval performance if you know you know. It is perfectly illegal to choose one ckpt that looks best on those benchmarks and move along

HumanEval is meaningless regardless, those 164 problems have been overfit to the tea.

Hook this up to LLM arena we will get a better picture regarding how powerful they really are

link

bilbo0s 793 days ago

"graduate student descent"

Ahhh that takes me back!

link