|
|
|
|
|
by upghost
546 days ago
|
|
I don't know how this is possible with LLM tests. The closed source models will get access to at least the questions when sending the questions over the fence via API. This gives closed source models
an enormous advantage over open-source models. The FrontierMath dataset has this same problem[1]. It's a shame because creating these benchmarks is time consuming and expensive. I don't know of a way to fix this except perhaps partially by using reward models to evaluate results on random questions instead of using datasets, but there would be a lot of reproducibility problems with that. Still -- not sure how to overcome this. [1]: https://news.ycombinator.com/item?id=42494217 |
|
I'm not worried about cheaters. We just need to lay out clear rules. You cannot look at the inputs or outputs in any way. You cannot log them. You cannot record them for future use. Either manually or in an automated way.
If someone cheats, they will be found out. Their contribution won't stand the test of time, no one will replicate those results with their method. And their performance on datasets that they cheated on will be astronomical compared to everything else.
FrontierMath is a great example of a failure in this space. By going closed, instead of using a license, they're created massive confusion. At first they told us that the benchmark was incredibly hard. And they showed reviewers subsets that were hard. Now, they're telling us that actually, 25% of the questions are easy. And 50% of the questions are pretty hard. But only a small fraction are what the reviewers saw.
Closed datasets aren't the answer. They're just unscientific nonsense. I refuse to even consider running on them.
We need test sets that are open for scrutiny. With licenses that prevent abuse. We can be very creative about the license. Like, you can only evaluate on this dataset once, and must preregister your evaluations.