Hacker News new | ask | show | jobs
by timbilt 310 days ago
> Unlike many public benchmarks, the PR Benchmark is private, and its data is not publicly released. This ensures models haven’t seen it during training, making results fairer and more indicative of real-world generalization.

This is key.

Public benchmarks are essentially trust-based and the trust just isn't there.

3 comments

Unless you're running the LLM yourself (locally), private benchmarks are also trust-based, aren't they?
Yes, but in a case like this it's a neutral third-party running the benchmark. So there isn't a direct incentive for them to favor one lab over another.

With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.

And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver.

The problem is when using a model hosted by those labs (ex: OpenAI only allowed access to o3 through their own direct API, not even Azure), there still exists a significant risk of cheating.

There's a long history of that sort of behaviour. ISPs gaming bandwidth tests when they detect one is being run. Software recognizing being run in a VM or on a particular configuration. I don't think it's a stretch to assume some of the money at OpenAI and others has gone into spotting likely benchmark queries and throwing on a little more compute or tagging them for future training.

I would be outright shocked if most of these benchmarks are even attempting serious countermeasures.

How does this ensure models haven’t seen it during training - is it a different benchmark per model release?
Then you just need to use different data the next time you evaluate. That is much more indicative of real-world generalization: after all, you don't normally do multiple PRs on the same pieces of code. The current approach risks leaking the dataset selectively and/or fudging the results because they can't be verified. Transparency is key when doing this kind of benchmark, so now we have to trust the entity doing the benchmarking rather than independent verification of the results and with the amount of money that is at stake here I don't think that's the way to go.