Hacker News new | ask | show | jobs
by rgbrgb 870 days ago
i don’t buy that premise. in practice we’re seeing a lot of evidence that you can’t trust the open evals because of contamination (maybe accidental, though there’s definitely incentive to cheat and move up the leaderboards).

closed/subjective ranking and evaluation has been around since there were critics. yes it’s hard to bootstrap trust, but i can’t see a way around it because the open evals can’t really be trusted either.

1 comments

I find this argument weird. I'm not saying you can trust the open evals, I'm just saying you can know their limits. Closed evals you're a lot more blind.