Hacker News new | ask | show | jobs
by crackrook 530 days ago
I have no clue if AGI will look anything like today's LLMs but I don't think the information we have about o3 so far suggests that it's particularly earth shaking or even a significant step towards AGI.

From the ARC announcement: "a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval." If I understand this correctly, o3's performance is not a grand leap beyond the capabilities of many times cheaper models with similarly privileged information. The ARC news seems more likely to be evidence that the benchmark needs tweaking than proof that scaling works (although OpenAI's marketing team would like us very much to interpret it as the latter).

There has also been a bit of imprecision and hand waving around other benchmarks that bolsters my skepticism. For instance the Codeforces benchmark results were touted with no meaningful description of the methodology and what little we do know suggests (to me, at least) that comparing o3's elo to that of a human is an apples to oranges comparison: https://codeforces.com/blog/entry/137539

1 comments

I don't understand. If kaggle solutions were able to do those, what the hell do these mean?

https://arcprize.org/2024-results

No individual Kaggle solution achieved a result of 81%, rather an ensemble of models: https://x.com/fchollet/status/1865865271728390515

In my (possibly flawed) interpretation: o3's scores appear to be an achievement because they were attained by a single model, but the benchmark itself needs refinement before it can claim to be a measure of AGI like it set out to be, as one can bruteforce their way to similar results.