| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by killthebuddha 683 days ago

In my opinion this blog post is a little bit misleading about the difference between o1 and earlier models. When I first heard about ARC-AGI (a few months ago, I think) I took a few of the ARC tasks and spent a few hours testing all the most powerful models. I was kind of surprised by how completely the models fell on their faces, even with heavy-handed feedback and various prompting techniques. None of the models came close to solving even the easiest puzzles. So today I tried again with o1-preview, and the model solved (probably the easiest) puzzle without any kind of fancy prompting:

https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...

Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.

4 comments

kobe_bryant 683 days ago

So it gives you the wrong answer and then you keep telling it how to fix it until it does? What does fancy prompting look like then, just feeding it the solution piece by piece?

killthebuddha 683 days ago

Basically yes, but there's a very wide range of how explicit the feedback could be. Here's an example where I tell gpt-4 exactly what the rule is and it still fails:

https://chatgpt.com/share/66e514d3-ca0c-8011-8d1e-43234391a0...

and an example using gpt-4o:

https://chatgpt.com/share/66e515da-a848-8011-987f-71dab56446...

I'd share similar examples using claude-3.5-sonnet but I can't figure out how to do it from the claud.ai ui.

To be clear, my point is not at all that o1 is so incredibly smart. IMO the ARC-AGI puzzles show very clearly how dumb even the most advanced models are. My point is just that o1 does seem to be noticeably better at solving these problems than previous models.

rahimnathwani 683 days ago

The easiest way I know of to share Claude chats is by using this Chrome extension to create a GitHub gist:

https://chromewebstore.google.com/detail/claudesave/bmdnfhji...

It's not perfect, but works fine for chats that don't have tables.

usaar333 683 days ago

> where I tell gpt-4 exactly what the rule is and it still fails

It figured out the rule itself. It has problems applying the rule.

In this example btw, asking it to write a program will solve the problem.

seaal 683 days ago

All examples are 404'd for me.

killthebuddha 683 days ago

Hmm. My first thought was that I shared non-public links, but I double-checked I can access them from another machine.

stingraycharles 683 days ago

FYI They load fine for me.

seaal 681 days ago

Yeah seems to just be an issue with my Firefox configuration -- works fine on Edge.

Zr01 683 days ago

The pages fail to load on old web browsers.

mikeknoop 683 days ago

Author here. Which aspects are misleading? How can it be improved?

killthebuddha 683 days ago

I think the post is great, clear and fair and all that. And I definitely agree with the general point that o1 shows some amount of improvement on generality but with a massive tradeoff on cost.

I'm going to think through what I find "misleading" as I write this...

Ok so I guess it's that I wouldn't be surprised at all if we learn that models can improve a ton w.r.t. human-in-the-loop prompt engineering (e.g. ChatGPT) without a commensurate improvement in programmatic prompt engineering.

It's very difficult to get a Python-driven claude-3.5-sonnet agent to solve ARC tasks and it's also very difficult to get claude-3.5-sonnet to solve ARC tasks via the claude.ai UI. The blog post shows that it's also very difficult to get a Python-driven o1-preview agent to solve ARC tasks. From a cursory exploration of o1-preview's capabilities in the ChatGPT UI my intuition is that it's significantly smarter than claude-3.5-sonnet based on how much better it responds to my human-in-the-loop feedback.

So I guess my point is that many people will probably come away from the blog post thinking "there's nothing to see here", o1-preview is more of the same thing, but it seems to me that it's very clearly qualitatively different than previous models.

Aside: This isn't a problem with the blog post at all IMO, we don't need to litter every benchmark post with a million caveats/exceptions/disclaimers/etc.

wokwokwok 683 days ago

I think the parent post is complaining that insufficient acknowledgement is given to how good o1 is, because in their contrived testing, it seems better than previous models.

I don’t think that’s true though, it’s hard to be more fair and explicit than:

> OpenAI o1-preview and o1-mini both outperform GPT-4o on the ARC-AGI public evaluation dataset. o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Ie. it’s just not that great, and it’s enormously slow.

That probably wasn’t what people wanted to hear, even if it is literally what the results show.

You cant run away from the numbers:

> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

(Side note: readers may be getting confused about what “test-time scaling” is, and why that’s important. TLDR: more compute is getting better results at inference time. That’s a big deal, because previously, pouring more compute at inference time didn’t seem to make much real difference; but overall I don’t see how anything you’ve said is either inaccurate or misleading)

mikeknoop 683 days ago

I personally am slightly surprised at o1's modest performance on ARC-AGI given the large leaps in performance on other objectively hard benchmarks like IOI and AIME.

Curiosity is the first step towards new ideas.

ARC Prize's whole goal is to inspire curiosity like this and to encourage more AI researchers to explore and openly share new approaches towards AGI.

HeatrayEnjoyer 683 days ago

What does minutes and hours even mean? Software comparison using absolute time duration is meaningless without a description of the system it was executed on; e.g. SHA256 hashes per second on a Win10 OS and i7-14100 processor. For a product as complex as multiuser TB-sized LLMs, compute time is dependent on everything from the VM software stack to the physical networking and memory caching architecture.

CPU/GPU cycles, FLOPs, IOPs, or even joules would be superior measurements.

achierius 683 days ago

These are API calls to a remote server. We don't have the option of scaling up or even measuring the compute they use to run them, so for better or worse the server cluster has to be measured as part of their model service offering.

HeatrayEnjoyer 681 days ago

I understand that, but that's only useful if you're only looking at it from a shallow business perspective.

tsimionescu 683 days ago

You're right about local software comparisons, but this is different. If I'm comparing two SaaS platforms, wall clock time to achieve a similar task is a fair metric to use. The only caveat is if the service offers some kind of tiered performance pricing, like if we were comapring a task performed on an AWS EC2 instance vs Azure VM instance, but that is not the case with these LLMs.

So yes, it may be that the wall clock time is not reflective of the performance of the model, but it is reflective of the performance of the SaaS offerings.

dr_dshiv 683 days ago

I mean, “scaling compute at inference” actually means “using an LLM agent system.” Haven’t we known that chains of agents can be useful for a while?

killthebuddha 683 days ago

I agree with basically everything you said but I think you've misunderstood my point. I'll reply to the other comment with more.

usaar333 683 days ago

Both Chat GPT 4o and Claude 3.5 can trivially solve this puzzle if you direct them to do program synthesis to solve it. (that is write a program that solves it - e.g. https://pastebin.com/wDTWYcSx).

Without program synthesis (the way you are doing it), the LLM inevitably fails to change the correct position (bad counting and what not)

riku_iki 683 days ago

and what prompt you gave them to generate program? Did you tell explicitly that they need to fill cornered cells? If yes, it is not what benchmark is about. Benchmark is to ask LLM to figure out what is the pattern.

I entered task to Claude and asked to write py code, and it failed to recognize pattern:

To solve this puzzle, we need to implement a program that follows the pattern observed in the given examples. It appears that the rule is to replace 'O' with 'X' when it's adjacent (horizontally, vertically, or diagonally) to exactly two '@' symbols. Let's write a Python program to solve this:

usaar333 683 days ago

arc reasoning challenge. I'm going to give you 2 example input/output pairs and then a third bare input. Please produce the correct third output.

It used its COT to understand cornering -- then I got it to write a program.

But as I try again, it's not reliable.

exe34 683 days ago

> But as I try again, it's not reliable.

this is why I will never try anything like this on a remote server I don't control. all my toy experiments are with local llms that I can make sure are the same ones day after day.

riku_iki 683 days ago

Interesting part if you check CoT output, the way it solved: it said the pattern is to make number of filled cells even in each row with neat layout, which is interesting side effect, but not what task was about.

It is also referring on some "assistant", looks like they have some mysterious component in addition to LLM, or another LLM.