| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by diggan 513 days ago
	> Tamay from Epoch AI here. We made a mistake in not being more transparent about OpenAI's involvement. We were restricted from disclosing the partnership until around the time o3 launched, and in hindsight we should have negotiated harder for the ability to be transparent to the benchmark contributors as soon as possible. Our contract specifically prevented us from disclosing information about the funding source and the fact that OpenAI has data access to much but not all of the dataset. Not sure if "integrity of the benchmarks" should even be something that you negotiate over, what's the value of the benchmark if the results cannot be trusted because of undisclosed relationships and sharing of data? Why would they be restricted from disclosing stuff you normally disclose, and how doesn't that raise all sorts of warning flags when proposed even?

2 comments

optimalsolver 513 days ago

>OpenAI has data access to much but not all of the dataset

Their head mathematician says they have the full dataset, except a holdout set which they're currently developing (i.e. doesn't exist yet):

https://www.reddit.com/r/singularity/comments/1i4n0r5/commen...

link

menaerus 512 days ago

Thanks for the link. A holdout set which is yet to be used to verify the 25% claim. He also says that he doesn't believe that OpenAI would self-sabotage themselves by tricking the internal benchmarking performance since this will get easily exposed, either by the results from a holdout set or by the public repeating the benchmarks themselves. Seems reasonable to me.

link

optimalsolver 512 days ago

>the public repeating the benchmarks themselves

The public has no access to this benchmark.

In fact, everyone thought it was all locked up in a vault at Epoch AI HQ, but looks like Sam Altman has a copy on his bedside table.

link

menaerus 512 days ago

Perhaps what he meant is that the public will be able to benchmark the model themselves by throwing different difficulty math problems at it and not necessarily the FrontierMath benchmark. It should become pretty obvious if they were faking the results or not.

link

optimalsolver 512 days ago

It's been found [0] that slightly varying Putnam problems causes a 30% drop in o1-Preview accuracy, but that hasn't put a dent in OAI's hype.

There's absolutely no comeuppance for juicing benchmarks, especially ones no one has access to. If performance of o3 doesn't meet expectations, there'll be plenty of people making excuses for it ("You're prompting it wrong!", "That's just not its domain!").

[0] https://openreview.net/forum?id=YXnwlZe0yf&noteId=yrsGpHd0Sf

link

menaerus 512 days ago

> If performance of o3 doesn't meet expectations, there'll be plenty of people making excuses for it

I agree and I can definitely see that happening but it is also not impossible, given the incentive and impact of this technology, for some other company/community to create yet another, perhaps, FrontierMath-like benchmark to cross-validate the results.

I also don't disagree that it is not impossible for OpenAI to have faked these results. Time will tell.

link

aunty_helen 513 days ago

This feels like a done deal. This benchmark should be discarded.

link