Hacker News new | ask | show | jobs
by ottaborra 531 days ago
Given how o3 cracked the arc bench and I'm probably sounding like a broken record, this isn't as farfetched as some of you may think it is. ML models will very likely continue to scale regardless of how many bets are placed against it. I'm not sure why a lot of people aren't concerned about arc bench being cracked so fast. Our grand delusions of specialness has been shown to just that, delusions

"Humanity is a just a small step in the giant staircase of intelligence" - Geoffrey Hinton

2 comments

I have no clue if AGI will look anything like today's LLMs but I don't think the information we have about o3 so far suggests that it's particularly earth shaking or even a significant step towards AGI.

From the ARC announcement: "a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval." If I understand this correctly, o3's performance is not a grand leap beyond the capabilities of many times cheaper models with similarly privileged information. The ARC news seems more likely to be evidence that the benchmark needs tweaking than proof that scaling works (although OpenAI's marketing team would like us very much to interpret it as the latter).

There has also been a bit of imprecision and hand waving around other benchmarks that bolsters my skepticism. For instance the Codeforces benchmark results were touted with no meaningful description of the methodology and what little we do know suggests (to me, at least) that comparing o3's elo to that of a human is an apples to oranges comparison: https://codeforces.com/blog/entry/137539

I don't understand. If kaggle solutions were able to do those, what the hell do these mean?

https://arcprize.org/2024-results

No individual Kaggle solution achieved a result of 81%, rather an ensemble of models: https://x.com/fchollet/status/1865865271728390515

In my (possibly flawed) interpretation: o3's scores appear to be an achievement because they were attained by a single model, but the benchmark itself needs refinement before it can claim to be a measure of AGI like it set out to be, as one can bruteforce their way to similar results.

What's arc bench?