| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bogtog 346 days ago

I wonder how much slow progress on ARC can be explained by their visual properties making them easy for humans but hard for LLMs.

My impression is that models are pretty bad at interpreting grids of characters. Yesterday, I was trying to get Claude to convert a message into a cipher where it converted a 98-character string into 7x14 grid where the sequential letters moved 2-right and 1-down (i.e., like a knight it chess). Claude seriously struggled.

Yet, Francois always pumps up the "fluid intelligence" component of this test and emphasizes how easy these are for humans. Yet, humans would presumably be terrible at the tasks if they looked at it character-by-character

This feels like a somewhat similar (intuition-lie?) case as the Apple paper showing how reasoning model's can't do tower of hanoi past 10+ disks. Readers will intuitively think about how they themselves could tediously do an infinitely long tower of hanoi, which is what the paper is trying to allude to. However, the more appropriate analogy would be writing out all >1000 moves on a piece of paper at once and being 100% correct, which is obviously much harder

2 comments

ACCount36 345 days ago

There are some major hints that this is indeed the case.

I've seen a simple ARC-AGI test that took the open set, and doubled every image in it. Every pixel became a 2x2 block of pixels.

If LLMs were bottlenecked solely by reasoning or logic capabilities, this wouldn't change their performance all that much, because the solution doesn't change all that much.

Instead, the performance dropped sharply - which hints that perception is the bottleneck.

link

krackers 345 days ago

I thought so too back when the test was first released, but now that we have multimodal models which can take images directly as input, shouldn't this point be moot?

link

bogtog 345 days ago

I think the top performer afaik (ChatGPT o3) is still treating ARC as a series of characters. I imagine complex reasoning in multimodal processing wouldn't be nearly as advanced so treating it as characters is still better

link

krackers 345 days ago

interesting, I thought one of the whole points of o3 was mixed multimodal reasoning (e.g. everyone doing those geoguesser challenges). But maybe that's just a parlor trick and it's not actually implemented that way. I wonder when they're going to extend chain-of-thought to work with image tokens, seems like that'd help for solving spatial challenges like this.

link

bogtog 343 days ago

I can't speak to whether it is a parlor trick, but my gut is that processing a 30x30 grid isn't really representative of o3's image processing. This tiny grid isn't like any image it would encounter normally and is so short that the benefits of language processing outweight the downsides.

I expect that for a much larger images (e.g., 300x300 grids) and for problems simpler than ARC, that o3's image processing would give it a lead over o3 processing a very long character stream.

link

ACCount36 345 days ago

Even the very best multimodal LLMs still suffer from a harsh perception bottleneck. They're impressive, but nowhere near as good as human visual cortex.

link