|
|
|
|
|
by krick
694 days ago
|
|
> Hidden It's pretty absurd that it can be a criteria for a good benchmark. You should dismiss any paper (e.g. benchmark result) that isn't repeatable, and by definition closed testing dataset is not repeatable and in fact doesn't provide much insight anyway, you can as well call it arbitrary curated rating, like those that useless journalists do ("top 50 most influential women of all time"). Obviously, this contradicts the requirement that testing set shouldn't be a subset of training set. It is kinda reasonable to assume that if you can access data on the internet, OpenAI also can. Unless all agree to respect at least one robots.txt file on the internet, and even then somebody can just repost something. I don't have a solution. I'm just saying this is complete bullshit, the idea that we are starting to exclaim "yay, hidden data! I can trust that!" just cannot be acceptable. |
|
> I don't have a solution. I'm just saying this is complete bullshit, the idea that we are starting to exclaim "yay, hidden data! I can trust that!" just cannot be acceptable.
This feels just unkind. If you acknowledge that it's a hard (impossible?) problem, you should give some leeway to people doing their best until a consensus on the right approach(es) exists.