|
|
|
|
|
by bogtog
346 days ago
|
|
I wonder how much slow progress on ARC can be explained by their visual properties making them easy for humans but hard for LLMs. My impression is that models are pretty bad at interpreting grids of characters. Yesterday, I was trying to get Claude to convert a message into a cipher where it converted a 98-character string into 7x14 grid where the sequential letters moved 2-right and 1-down (i.e., like a knight it chess). Claude seriously struggled. Yet, Francois always pumps up the "fluid intelligence" component of this test and emphasizes how easy these are for humans. Yet, humans would presumably be terrible at the tasks if they looked at it character-by-character This feels like a somewhat similar (intuition-lie?) case as the Apple paper showing how reasoning model's can't do tower of hanoi past 10+ disks. Readers will intuitively think about how they themselves could tediously do an infinitely long tower of hanoi, which is what the paper is trying to allude to. However, the more appropriate analogy would be writing out all >1000 moves on a piece of paper at once and being 100% correct, which is obviously much harder |
|
I've seen a simple ARC-AGI test that took the open set, and doubled every image in it. Every pixel became a 2x2 block of pixels.
If LLMs were bottlenecked solely by reasoning or logic capabilities, this wouldn't change their performance all that much, because the solution doesn't change all that much.
Instead, the performance dropped sharply - which hints that perception is the bottleneck.