| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wokwokwok 683 days ago

I think the parent post is complaining that insufficient acknowledgement is given to how good o1 is, because in their contrived testing, it seems better than previous models.

I don’t think that’s true though, it’s hard to be more fair and explicit than:

> OpenAI o1-preview and o1-mini both outperform GPT-4o on the ARC-AGI public evaluation dataset. o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Ie. it’s just not that great, and it’s enormously slow.

That probably wasn’t what people wanted to hear, even if it is literally what the results show.

You cant run away from the numbers:

> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

(Side note: readers may be getting confused about what “test-time scaling” is, and why that’s important. TLDR: more compute is getting better results at inference time. That’s a big deal, because previously, pouring more compute at inference time didn’t seem to make much real difference; but overall I don’t see how anything you’ve said is either inaccurate or misleading)

4 comments

mikeknoop 683 days ago

I personally am slightly surprised at o1's modest performance on ARC-AGI given the large leaps in performance on other objectively hard benchmarks like IOI and AIME.

Curiosity is the first step towards new ideas.

ARC Prize's whole goal is to inspire curiosity like this and to encourage more AI researchers to explore and openly share new approaches towards AGI.

link

HeatrayEnjoyer 683 days ago

What does minutes and hours even mean? Software comparison using absolute time duration is meaningless without a description of the system it was executed on; e.g. SHA256 hashes per second on a Win10 OS and i7-14100 processor. For a product as complex as multiuser TB-sized LLMs, compute time is dependent on everything from the VM software stack to the physical networking and memory caching architecture.

CPU/GPU cycles, FLOPs, IOPs, or even joules would be superior measurements.

link

achierius 683 days ago

These are API calls to a remote server. We don't have the option of scaling up or even measuring the compute they use to run them, so for better or worse the server cluster has to be measured as part of their model service offering.

link

HeatrayEnjoyer 681 days ago

I understand that, but that's only useful if you're only looking at it from a shallow business perspective.

link

tsimionescu 683 days ago

You're right about local software comparisons, but this is different. If I'm comparing two SaaS platforms, wall clock time to achieve a similar task is a fair metric to use. The only caveat is if the service offers some kind of tiered performance pricing, like if we were comapring a task performed on an AWS EC2 instance vs Azure VM instance, but that is not the case with these LLMs.

So yes, it may be that the wall clock time is not reflective of the performance of the model, but it is reflective of the performance of the SaaS offerings.

link

dr_dshiv 683 days ago

I mean, “scaling compute at inference” actually means “using an LLM agent system.” Haven’t we known that chains of agents can be useful for a while?

link

killthebuddha 683 days ago

I agree with basically everything you said but I think you've misunderstood my point. I'll reply to the other comment with more.

link