| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jmye 111 days ago
	> I'm not sure how groundbreaking the main insight is. I think it likely is groundbreaking for a number of people (especially non-tech CTOs and VPs) who make decisions based on these benchmarks and who have never wondered what the scores are actually scoring.

1 comments

mzelling 111 days ago

I'm not sure if the paper's findings are all that actionable. The paper doesn't say "here's how benchmarks are currently being gamed." It says "here's how benchmarks could in theory be gamed."

Whether benchmark results are misleading depends more on the reporting organization than on the benchmark. Integrity and competence play large roles in this. When OpenAI reports a benchmark number, I trust it more than when that same number is reported by a couple Stanford undergrads posting "we achieved SOTA on XYZ benchmark" all over Twitter.

link

jmye 111 days ago

I think that’s totally fair!

I guess I look at this less as an “ah ha! They’re all cheating!” and more of a “were you guys even aware of what the benchmarks represented and how they checked them?”

link

mzelling 110 days ago

That's a great way to look at it. The paper is a reality check for anyone who thinks of benchmarks as these monolithic, oracular judges of performance. It highlights the soft underbelly of benchmarking.

link

lukev 110 days ago

Did you read the article? There's a whole section on "this is already happening."

link

mzelling 110 days ago

Yes, I did see that section. We've known for a while that reward hacking, train/test data contamination, etc. must be taken seriously. Researchers are actively guarding against these problems. This paper explores what happens when researchers flip their stance and actively try to reward hack — how far can they push it? The answer is "very far."

link