| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zarzavat 510 days ago
	OpenAI played themselves here. Now nobody is going to take any of their results on this benchmark seriously, ever again. That o3 result has just disappeared in a poof of smoke. If they had blinded themselves properly then that wouldn't be the case. Whereas other AI companies now have the opportunity to be first to get a significant result on FrontierMath.

3 comments

colonial 510 days ago

I'd be surprised if any of their in-house benchmark results are taken seriously after this. As an extremely rough estimate, FrontierMath cost five to six figures to assemble [1] - so from an outside view, they clearly have no qualms with turning cash into quasi-guaranteed benchmark results.

[1]: https://epoch.ai/math-problems/submit-problem - the benchmark is comprised of "hundreds" of questions, so at the absolute lowest it cost 300 * 200 = 60,000 dollars.

link

red75prime 510 days ago

Conversely, if they didn't cheat and they funded creation of the test suite to get "clean" problems (while hiding their participation to prevent getting problems that are somehow tailored to be hard for LLMs specifically), then they have no reasons to fear that all this looks fishy as the test results will soon be vindicated when they'll give wider access to the model.

I refrain from forming a strong opinion in such situations. My intuition tells me that it's not cheating. But, well, it's intuition (probably based on my belief that the brain is nothing special physics-wise and it doesn't manage to realize unknown quantum algorithms in its warm and messy environment, so that classical computers can reproduce all of its feats when having appropriate algorithms and enough computing power. And math reasoning is just another step on a ladder of capabilities, not something that requires completely different approach). So, we'll see.

link

klabb3 510 days ago

> based on my belief that the brain is nothing special physics-wise and it doesn't manage to realize unknown quantum algorithms in its warm and messy environment

Agreed (well as much as intuition goes), but current gen AI is not a brain, much less a human brain. It shows similarities, in particular emerging multi-modal pattern matching capabilities. There is nothing that says that’s all the neocortex does, in fact the opposite is a known truth in neuroscience. We just don’t know all functions yet - we can’t just ignore the massive Chesterton’s fence we don’t understand.

This isn’t even necessarily because the brain is more sophisticated than anything else, we don’t have models for the weather and immune system or anything chaotic really. Look, folding proteins is still a research problem and that’s at the level of known molecular structure. We greatly overestimate our abilities to model & simulate things. Todays AI is a prime example of our wishful thinking and glossing over ”details”.

> so that classical computers can reproduce all of its feats when having appropriate algorithms and enough computing power.

Sure. That’s a reasonable hypothesis.

> And math reasoning is just another step on a ladder of capabilities, not something that requires completely different approach

You seem to be assuming ”ability” is single axis. It’s like assuming if we get 256 bit registers computers will start making coffee, or that going to the gym will eventually give you wings. There is nothing that suggests this. In fact, if you look at emerging ability in pattern matching that improved enormously, while seeing reasoning on novel problems sitting basically still, that suggests strongly that we are looking at a multi-axis problem domain.

link

red75prime 497 days ago

> if you look at emerging ability in pattern matching that improved enormously, while seeing reasoning on novel problems sitting basically still

About two years ago I came to the opinion that autoregressive models of reasonable size will not be able to capture the fullness of human abilities (mostly due to a limited compute per token). So it's not a surprise to me. But training based on reinforcement learning might be able to overcome this.

I don't believe that some specialized mechanisms are required to do math.

link

eksu 510 days ago

This risk could be mitigated by publishing the test.

link