Hacker News new | ask | show | jobs
by killthebuddha 636 days ago
In my opinion this blog post is a little bit misleading about the difference between o1 and earlier models. When I first heard about ARC-AGI (a few months ago, I think) I took a few of the ARC tasks and spent a few hours testing all the most powerful models. I was kind of surprised by how completely the models fell on their faces, even with heavy-handed feedback and various prompting techniques. None of the models came close to solving even the easiest puzzles. So today I tried again with o1-preview, and the model solved (probably the easiest) puzzle without any kind of fancy prompting:

https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...

Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.

4 comments

So it gives you the wrong answer and then you keep telling it how to fix it until it does? What does fancy prompting look like then, just feeding it the solution piece by piece?
Basically yes, but there's a very wide range of how explicit the feedback could be. Here's an example where I tell gpt-4 exactly what the rule is and it still fails:

https://chatgpt.com/share/66e514d3-ca0c-8011-8d1e-43234391a0...

and an example using gpt-4o:

https://chatgpt.com/share/66e515da-a848-8011-987f-71dab56446...

I'd share similar examples using claude-3.5-sonnet but I can't figure out how to do it from the claud.ai ui.

To be clear, my point is not at all that o1 is so incredibly smart. IMO the ARC-AGI puzzles show very clearly how dumb even the most advanced models are. My point is just that o1 does seem to be noticeably better at solving these problems than previous models.

The easiest way I know of to share Claude chats is by using this Chrome extension to create a GitHub gist:

https://chromewebstore.google.com/detail/claudesave/bmdnfhji...

It's not perfect, but works fine for chats that don't have tables.

> where I tell gpt-4 exactly what the rule is and it still fails

It figured out the rule itself. It has problems applying the rule.

In this example btw, asking it to write a program will solve the problem.

All examples are 404'd for me.
Hmm. My first thought was that I shared non-public links, but I double-checked I can access them from another machine.
FYI They load fine for me.
Yeah seems to just be an issue with my Firefox configuration -- works fine on Edge.
The pages fail to load on old web browsers.
Author here. Which aspects are misleading? How can it be improved?
I think the post is great, clear and fair and all that. And I definitely agree with the general point that o1 shows some amount of improvement on generality but with a massive tradeoff on cost.

I'm going to think through what I find "misleading" as I write this...

Ok so I guess it's that I wouldn't be surprised at all if we learn that models can improve a ton w.r.t. human-in-the-loop prompt engineering (e.g. ChatGPT) without a commensurate improvement in programmatic prompt engineering.

It's very difficult to get a Python-driven claude-3.5-sonnet agent to solve ARC tasks and it's also very difficult to get claude-3.5-sonnet to solve ARC tasks via the claude.ai UI. The blog post shows that it's also very difficult to get a Python-driven o1-preview agent to solve ARC tasks. From a cursory exploration of o1-preview's capabilities in the ChatGPT UI my intuition is that it's significantly smarter than claude-3.5-sonnet based on how much better it responds to my human-in-the-loop feedback.

So I guess my point is that many people will probably come away from the blog post thinking "there's nothing to see here", o1-preview is more of the same thing, but it seems to me that it's very clearly qualitatively different than previous models.

Aside: This isn't a problem with the blog post at all IMO, we don't need to litter every benchmark post with a million caveats/exceptions/disclaimers/etc.

I think the parent post is complaining that insufficient acknowledgement is given to how good o1 is, because in their contrived testing, it seems better than previous models.

I don’t think that’s true though, it’s hard to be more fair and explicit than:

> OpenAI o1-preview and o1-mini both outperform GPT-4o on the ARC-AGI public evaluation dataset. o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Ie. it’s just not that great, and it’s enormously slow.

That probably wasn’t what people wanted to hear, even if it is literally what the results show.

You cant run away from the numbers:

> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

(Side note: readers may be getting confused about what “test-time scaling” is, and why that’s important. TLDR: more compute is getting better results at inference time. That’s a big deal, because previously, pouring more compute at inference time didn’t seem to make much real difference; but overall I don’t see how anything you’ve said is either inaccurate or misleading)

I personally am slightly surprised at o1's modest performance on ARC-AGI given the large leaps in performance on other objectively hard benchmarks like IOI and AIME.

Curiosity is the first step towards new ideas.

ARC Prize's whole goal is to inspire curiosity like this and to encourage more AI researchers to explore and openly share new approaches towards AGI.

What does minutes and hours even mean? Software comparison using absolute time duration is meaningless without a description of the system it was executed on; e.g. SHA256 hashes per second on a Win10 OS and i7-14100 processor. For a product as complex as multiuser TB-sized LLMs, compute time is dependent on everything from the VM software stack to the physical networking and memory caching architecture.

CPU/GPU cycles, FLOPs, IOPs, or even joules would be superior measurements.

These are API calls to a remote server. We don't have the option of scaling up or even measuring the compute they use to run them, so for better or worse the server cluster has to be measured as part of their model service offering.
I understand that, but that's only useful if you're only looking at it from a shallow business perspective.
You're right about local software comparisons, but this is different. If I'm comparing two SaaS platforms, wall clock time to achieve a similar task is a fair metric to use. The only caveat is if the service offers some kind of tiered performance pricing, like if we were comapring a task performed on an AWS EC2 instance vs Azure VM instance, but that is not the case with these LLMs.

So yes, it may be that the wall clock time is not reflective of the performance of the model, but it is reflective of the performance of the SaaS offerings.

I mean, “scaling compute at inference” actually means “using an LLM agent system.” Haven’t we known that chains of agents can be useful for a while?
I agree with basically everything you said but I think you've misunderstood my point. I'll reply to the other comment with more.
Both Chat GPT 4o and Claude 3.5 can trivially solve this puzzle if you direct them to do program synthesis to solve it. (that is write a program that solves it - e.g. https://pastebin.com/wDTWYcSx).

Without program synthesis (the way you are doing it), the LLM inevitably fails to change the correct position (bad counting and what not)

and what prompt you gave them to generate program? Did you tell explicitly that they need to fill cornered cells? If yes, it is not what benchmark is about. Benchmark is to ask LLM to figure out what is the pattern.

I entered task to Claude and asked to write py code, and it failed to recognize pattern:

To solve this puzzle, we need to implement a program that follows the pattern observed in the given examples. It appears that the rule is to replace 'O' with 'X' when it's adjacent (horizontally, vertically, or diagonally) to exactly two '@' symbols. Let's write a Python program to solve this:

arc reasoning challenge. I'm going to give you 2 example input/output pairs and then a third bare input. Please produce the correct third output.

It used its COT to understand cornering -- then I got it to write a program.

But as I try again, it's not reliable.

> But as I try again, it's not reliable.

this is why I will never try anything like this on a remote server I don't control. all my toy experiments are with local llms that I can make sure are the same ones day after day.

Interesting part if you check CoT output, the way it solved: it said the pattern is to make number of filled cells even in each row with neat layout, which is interesting side effect, but not what task was about.

It is also referring on some "assistant", looks like they have some mysterious component in addition to LLM, or another LLM.