Hacker News new | ask | show | jobs
by p1esk 551 days ago
Are these benchmarks still meaningful?
2 comments

No, and they haven't been for at least half a year. Utterly optimized for by the providers. Nowadays if a model would be SotA for general use but not #1 on any of these benchmarks, I doubt they'd even release it.
I've started keeping an eye out for original brainteasers, just for that reason. GCHQ's Christmas puzzle just came out [1], and o1-pro got 6 out of 7 of them right. It took about 20 minutes in total.

I wasn't going to bother trying those because I was pretty sure it wouldn't get any of them, but decided to give it an easy one (#4) and was impressed at the CoT.

Meanwhile, Google's newest 2.0 Flash model went 0 for 7.

1: https://metro.co.uk/2024/12/11/gchq-christmas-puzzle-2024-re...

Why are you comparing flash vs o1-pro, wouldn't a more fair comparison be flash vs mini?
I just ask o1-mini the first two questions and it got it wrong.
It's the only Google model that my account has access to that accepts .PNG files. I assumed it was the latest/greatest experimental 2.0 release.

If they want a rematch, they'll need to bring their 'A' game next time, because o1-pro is crazy good.

Did it get the 8 right? The linked article provides the wrong answer btw.
I didn't see a straightforward way to submit the final problem, because I used different contexts for each of the 7 subproblems.

Given the right prompt, though, I'm sure it could handle the 'find the corresponding letter from the landmarks to form an anagram' part. That's easier than most of the other problems.

You're saying the ultimate answer isn't 'PROTECTING THE UNITED KINGDOM'?

if you follow the sleigh morse path starting from the robin it will be 'united in protecting the kingdom'.
Wow! That’s all I need to know about Google’s model.
What is impressive about this new model is that it is the lightweight version (flash).

There will probably be a 2.0 pro (which will be 4o/sonnet class) and maybe an ultra (o1(?)/Opus).

That's a comparison of multiple GPT-4 models working together... against a single GPT-4 mini style model.
multiple GPT-4 models working together

What do you mean? Is o1 not a single model?