| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by p1esk 551 days ago
	Are these benchmarks still meaningful?

2 comments

maeil 551 days ago

No, and they haven't been for at least half a year. Utterly optimized for by the providers. Nowadays if a model would be SotA for general use but not #1 on any of these benchmarks, I doubt they'd even release it.

link

CamperBob2 551 days ago

I've started keeping an eye out for original brainteasers, just for that reason. GCHQ's Christmas puzzle just came out [1], and o1-pro got 6 out of 7 of them right. It took about 20 minutes in total.

I wasn't going to bother trying those because I was pretty sure it wouldn't get any of them, but decided to give it an easy one (#4) and was impressed at the CoT.

Meanwhile, Google's newest 2.0 Flash model went 0 for 7.

1: https://metro.co.uk/2024/12/11/gchq-christmas-puzzle-2024-re...

link

iamdelirium 550 days ago

Why are you comparing flash vs o1-pro, wouldn't a more fair comparison be flash vs mini?

link

iamdelirium 550 days ago

I just ask o1-mini the first two questions and it got it wrong.

link

CamperBob2 550 days ago

It's the only Google model that my account has access to that accepts .PNG files. I assumed it was the latest/greatest experimental 2.0 release.

If they want a rematch, they'll need to bring their 'A' game next time, because o1-pro is crazy good.

link

nrvn 550 days ago

Did it get the 8 right? The linked article provides the wrong answer btw.

link

CamperBob2 550 days ago

I didn't see a straightforward way to submit the final problem, because I used different contexts for each of the 7 subproblems.

Given the right prompt, though, I'm sure it could handle the 'find the corresponding letter from the landmarks to form an anagram' part. That's easier than most of the other problems.

You're saying the ultimate answer isn't 'PROTECTING THE UNITED KINGDOM'?

link

nrvn 550 days ago

if you follow the sleigh morse path starting from the robin it will be 'united in protecting the kingdom'.

link

p1esk 550 days ago

Wow! That’s all I need to know about Google’s model.

link

Workaccount2 550 days ago

What is impressive about this new model is that it is the lightweight version (flash).

There will probably be a 2.0 pro (which will be 4o/sonnet class) and maybe an ultra (o1(?)/Opus).

link

danpalmer 550 days ago

That's a comparison of multiple GPT-4 models working together... against a single GPT-4 mini style model.

link

p1esk 550 days ago

multiple GPT-4 models working together

What do you mean? Is o1 not a single model?

link