Hacker News new | ask | show | jobs
by krick 694 days ago
> Hidden

It's pretty absurd that it can be a criteria for a good benchmark. You should dismiss any paper (e.g. benchmark result) that isn't repeatable, and by definition closed testing dataset is not repeatable and in fact doesn't provide much insight anyway, you can as well call it arbitrary curated rating, like those that useless journalists do ("top 50 most influential women of all time").

Obviously, this contradicts the requirement that testing set shouldn't be a subset of training set. It is kinda reasonable to assume that if you can access data on the internet, OpenAI also can. Unless all agree to respect at least one robots.txt file on the internet, and even then somebody can just repost something.

I don't have a solution. I'm just saying this is complete bullshit, the idea that we are starting to exclaim "yay, hidden data! I can trust that!" just cannot be acceptable.

1 comments

One way to handle this might be to have the data hidden, but verifiable in the future. That is: publish a signed hash of the benchmark questions, and every X amount of time swap them out and publish the old ones.

> I don't have a solution. I'm just saying this is complete bullshit, the idea that we are starting to exclaim "yay, hidden data! I can trust that!" just cannot be acceptable.

This feels just unkind. If you acknowledge that it's a hard (impossible?) problem, you should give some leeway to people doing their best until a consensus on the right approach(es) exists.