| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by light_hue_1 546 days ago

It's possible.

I'm not worried about cheaters. We just need to lay out clear rules. You cannot look at the inputs or outputs in any way. You cannot log them. You cannot record them for future use. Either manually or in an automated way.

If someone cheats, they will be found out. Their contribution won't stand the test of time, no one will replicate those results with their method. And their performance on datasets that they cheated on will be astronomical compared to everything else.

FrontierMath is a great example of a failure in this space. By going closed, instead of using a license, they're created massive confusion. At first they told us that the benchmark was incredibly hard. And they showed reviewers subsets that were hard. Now, they're telling us that actually, 25% of the questions are easy. And 50% of the questions are pretty hard. But only a small fraction are what the reviewers saw.

Closed datasets aren't the answer. They're just unscientific nonsense. I refuse to even consider running on them.

We need test sets that are open for scrutiny. With licenses that prevent abuse. We can be very creative about the license. Like, you can only evaluate on this dataset once, and must preregister your evaluations.

1 comments

upghost 546 days ago

I would like to agree with you, but I doubt the honor system will work here. We are talking about companies that have blatantly trampled (or are willing to risk a judicial confrontation about trampling) copyright. It would be unreasonable to assume they would not engage in the same behavior about benchmarks and test sets, especially with the amount of money on the line for the winners.

link