| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tigershark 585 days ago
	Where is the plateau? Chatgtp 4 was ~0% in ARC-AGI. 4o was 5%. This model literally solved it with a score higher than the 85% of the average human. And let’s not forget the unbelievable 25% in frontier math, where all the most brilliant mathematicians in the world cannot solve by themselves a lot of the problems. We are speaking about cutting edge math research problems that are out of reach from practically everyone. You will get a rude awakening if you call this unbelievable advancement a “plateau”.

2 comments

csomar 585 days ago

I don't care about benchmarks. O1 ranks higher than Claude on "benchmarks" but performs worse on particular real life coding situations. I'll judge the model myself by how useful/correct it is for my tasks rather than a hypothetical benchmarks.

link

famouswaffles 585 days ago

In most non-competitive coding benchmarks (aider, live bench, swe-bench), o1 ranks worse than Sonnet (so the benchmarks aren't saying anything different) or at least did, the new checkpoint 2 days ago finally pushed o1 over sonnet on livebench.

link

tigershark 585 days ago

As I said, o3 demonstrated field medal level research capacity in the frontier math tests. But I’m sure that your use cases are much more difficult than that, obviously.

link

riku_iki 584 days ago

there are many comments in internet about this, that only subset of frontier math benchmark is "field medal level research", and o3 likely scored on easier subset.

Also, all that stuff is shady in the way that it is just numbers from OAI, which are not reproducible on benchmark sponsored by OAI. If we say OAI could be bad actor, they had plenty of opportunities to cheat on this.

link

whynotminot 585 days ago

“Objective benchmarks are useless, let’s argue about which one works better for me personally.”

link

csomar 585 days ago

Yes. My benchmarks and their benchmarks means AGI. Their benchmarks only means over-fitted.

link

whynotminot 585 days ago

Ok so what if we get different results for our own personal benchmarks/use cases.

(See why objective benchmarks exist?)

link

bakugo 585 days ago

Yes, "objective" benchmarks can be gamed, real-life tasks cannot.

link

YeGoblynQueenne 585 days ago

AI benchmarks and tests that claim to measure understanding, reasoning, intelligence, and so on are a dime a dozen. Chess, Go, Atari, Jeopardy, Raven's Progressive Matrices, the Winograd Schema Challenge, Starcraft... and so on and so forth.

Or let's talk about the breakthroughs. SVMs would lead us to AGI. Then LSTMs would lead us to AGI. Then Convnets would lead us to AGI. Then DeepRL would lead us to AGI. Now Transformers will lead us to AGI.

Benchmarks fall right and left and we keep being led to AGI but we never get there. It leaves one with such a feeling of angst. Are we ever gonna get to AGI? When's Godot coming?

link