| HN Mirror

The problem is when using a model hosted by those labs (ex: OpenAI only allowed access to o3 through their own direct API, not even Azure), there still exists a significant risk of cheating.

There's a long history of that sort of behaviour. ISPs gaming bandwidth tests when they detect one is being run. Software recognizing being run in a VM or on a particular configuration. I don't think it's a stretch to assume some of the money at OpenAI and others has gone into spotting likely benchmark queries and throwing on a little more compute or tagging them for future training.

I would be outright shocked if most of these benchmarks are even attempting serious countermeasures.