Hacker News new | ask | show | jobs
by zhisbug 1181 days ago
but how could you guarantee the pretrained model haven't seen those benchmarks? And the baselines you are comparing to (chatgpt, bard) haven't as well? Cuz those benchmarking datasets are also collected from Internet right?
1 comments

As a baseline, don't! If it performs horribly on the test and it cheated, that's even worse than if it fails the test and didn't cheat. So the benchmark score gives you an upper bound on performance.