| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by threepts 60 days ago

Why don't they ask their premier model to generate a bench for them?

Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy.

Leaderboard: https://arcprize.org/leaderboard

(Most premier models don't even pass 5 percent.)

4 comments

falcor84 60 days ago

They focus on minimizing the number of moves and don't allow any harness whatsoever, putting the bar extremely high. The current top verified contender (Claude Opus 4.6) is at only 0.45%. But with how new it is, I expect a lot of improvement in the next generation of models.

threepts 60 days ago

Optimal for judging actual reasoning ability rather than an LLM's ability to regurgitate knowledge from a necropost on HN/Reddit/Twitter from 2018.

knollimar 60 days ago

a small harness that stores text files and manages context could be useful, otherwise you lose all ability to measure that skill (and that's important because it represents real world use cases on large code bases)

anthonypasq 59 days ago

arc agi isnt testing a models ability to store files and code things. its testings its ability to reason through puzzles given the same information as a human

falcor84 56 days ago

But that's the thing, as a human faced with a problem I'd often say "Sure, just let me get a pen, some paper and a calculator". Why shouldn't we make it easy for AIs to use their tools of choice?

knollimar 58 days ago

if you tested my ability to reason and you gave me some challenging problems that involved arithmetic, it might be a better test if you gave me a scratch pad so I don't mess up the reasoning parts by failing arithmetic.

jjmarr 60 days ago

I'm making an LLM agent that can play DS games. The biggest blocker is clicking on the right spot to move things around in space rather than reasoning abilities.

Arc AGI seems to test that as well. Every game is a rectangular grid to make it as easy as possible yet the AIs still fail.

I'm fairly certain the way forward isn't through agents directly interfacing with UIs but through agents using scripts and other tools to interact with the interface. That's why harnesses are so critical to performance on tasks like this.

I would like a version of Arc AGI that tests the agent's ability to dynamically create these harnesses.

anthonypasq96 60 days ago

the whole point of arc-agi 3 is that if models are AGI then they should be able to solve the same tasks as humans do given the same information, but they cant. allowing scripts and harnesses and whatnot completely defeats the purpose.

jjmarr 60 days ago

Humans haven't interacted with computers by typing in "5 columns right, 3 columns down" since before I was born. They use a mouse and keyboard.

Meanwhile AI agents are expected to guess pixels and fail each time.

falcor84 60 days ago

But humans aren't just a "reasoning component"; our nervous system (and body in general) provides us with significant capabilities that would be considered a "harness" for our frontal lobe. It just seems silly to me to try to solve all of this in a single leap. But I guess that they just feel burned by how relatively quickly ARC-AGI 2 was solved

sowbug 60 days ago

Why don't they ask their premier model to generate a bench for them?

It's not a crazy idea. Have the older model interview the newer one and then ask both (or maybe a third referee model) which one they think is smarter. Repeat 100x with different seeds. The percentage of times both sides agree the newer model won is the score.

alansaber 60 days ago

Very (reasoning) heavy benchmarks do seem like the way to go, being the hardest to game.

xtracto 60 days ago

Can AI write a problem so difficult that even AI cannot solve?

Hehe

ngruhn 60 days ago

How about prime factorization

andriy_koval 59 days ago

this was created by humans.