| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ripbozo 125 days ago
	Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

4 comments

maxall4 125 days ago

Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.

link

moffkalast 125 days ago

https://arcprize.org/arc-agi/1/

It's a sort of arbitrary pattern matching thing that can't be trained on in the sense that the MMLU can be, but you can definitely generate billions of examples of this kind of task and train on it, and it will not make the model better on any other task. So in that sense, it absolutely can be.

I think it's been harder to solve because it's a visual puzzle, and we know how well today's vision encoders actually work https://arxiv.org/html/2407.06581v1

link

km144 124 days ago

The real question is: Why are people designing benchmarks that, if a model is trained on them, it won't improve the performance of the model at any real-world tasks? Why would anyone care about such benchmarks?

link

moffkalast 124 days ago

People are like typewriter monkeys, if something is possible to make it'll eventually be made.

link

boplicity 125 days ago

Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.

link

energy123 125 days ago

Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.

link

tasuki 125 days ago

Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?

link

ainch 125 days ago

He's always said ARC is a necessary but not sufficient condition for testing intelligence afaik

link

energy123 125 days ago

He said in an interview that it doesn't count if it's explicitly targeted, only if a model generalizes to it.

He also said that the "real test of intelligence" is being unable to come up with new tests that a human can easily do that the AI can't, not in being able to pass any specific benchmark.

link

CamperBob2 125 days ago

I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?

link

layer8 125 days ago

The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure

link

segmondy 124 days ago

He should have kept it closed.

link

blinding-streak 125 days ago

I assume all the frontier models are benchmaxxing, so it would make sense

link