|
|
|
|
|
by timbilt
310 days ago
|
|
Yes, but in a case like this it's a neutral third-party running the benchmark. So there isn't a direct incentive for them to favor one lab over another. With public benchmarks we're trusting the labs not to cheat. And it's easy to
"cheat" accidentally - they actually need to make a serious effort to not contaminate the training data. And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver. |
|
There's a long history of that sort of behaviour. ISPs gaming bandwidth tests when they detect one is being run. Software recognizing being run in a VM or on a particular configuration. I don't think it's a stretch to assume some of the money at OpenAI and others has gone into spotting likely benchmark queries and throwing on a little more compute or tagging them for future training.
I would be outright shocked if most of these benchmarks are even attempting serious countermeasures.