Hacker News new | ask | show | jobs
by 0xab 546 days ago
Datasets need to stop shipping with any training sets at all! And they should forbid anyone from using the test set to update the parameters of any model through their license.

We did this with ObjectNet (https://objectnet.dev/) years ago. It's only a test set, no training set provided at all. Back then it was very controversial and we were given a hard time for it initially. Now it's more accepted. Time to make this idea mainstream.

No more training sets. Everything should be out of domain.

2 comments

I don't know how this is possible with LLM tests. The closed source models will get access to at least the questions when sending the questions over the fence via API.

This gives closed source models an enormous advantage over open-source models.

The FrontierMath dataset has this same problem[1].

It's a shame because creating these benchmarks is time consuming and expensive.

I don't know of a way to fix this except perhaps partially by using reward models to evaluate results on random questions instead of using datasets, but there would be a lot of reproducibility problems with that.

Still -- not sure how to overcome this.

[1]: https://news.ycombinator.com/item?id=42494217

It's possible.

I'm not worried about cheaters. We just need to lay out clear rules. You cannot look at the inputs or outputs in any way. You cannot log them. You cannot record them for future use. Either manually or in an automated way.

If someone cheats, they will be found out. Their contribution won't stand the test of time, no one will replicate those results with their method. And their performance on datasets that they cheated on will be astronomical compared to everything else.

FrontierMath is a great example of a failure in this space. By going closed, instead of using a license, they're created massive confusion. At first they told us that the benchmark was incredibly hard. And they showed reviewers subsets that were hard. Now, they're telling us that actually, 25% of the questions are easy. And 50% of the questions are pretty hard. But only a small fraction are what the reviewers saw.

Closed datasets aren't the answer. They're just unscientific nonsense. I refuse to even consider running on them.

We need test sets that are open for scrutiny. With licenses that prevent abuse. We can be very creative about the license. Like, you can only evaluate on this dataset once, and must preregister your evaluations.

I would like to agree with you, but I doubt the honor system will work here. We are talking about companies that have blatantly trampled (or are willing to risk a judicial confrontation about trampling) copyright. It would be unreasonable to assume they would not engage in the same behavior about benchmarks and test sets, especially with the amount of money on the line for the winners.
I understand the idea but I don't think that it is beneficial in the end.

Access to the dataset is needed to understand why we get a given result. First from a transparency point of view to check if results make sense and why one model is favored compared to another one.

But also, it is needed to understand why a model will perform badly on some aspect to be able to determine how to improve the model.