| HN Mirror

It's possible.

I'm not worried about cheaters. We just need to lay out clear rules. You cannot look at the inputs or outputs in any way. You cannot log them. You cannot record them for future use. Either manually or in an automated way.

If someone cheats, they will be found out. Their contribution won't stand the test of time, no one will replicate those results with their method. And their performance on datasets that they cheated on will be astronomical compared to everything else.

FrontierMath is a great example of a failure in this space. By going closed, instead of using a license, they're created massive confusion. At first they told us that the benchmark was incredibly hard. And they showed reviewers subsets that were hard. Now, they're telling us that actually, 25% of the questions are easy. And 50% of the questions are pretty hard. But only a small fraction are what the reviewers saw.

Closed datasets aren't the answer. They're just unscientific nonsense. I refuse to even consider running on them.

We need test sets that are open for scrutiny. With licenses that prevent abuse. We can be very creative about the license. Like, you can only evaluate on this dataset once, and must preregister your evaluations.