|
|
|
|
|
by famouswaffles
79 days ago
|
|
The harness was designed with the preview, but no it was still tested on the full public set in that environment. You can run the benchmark in different 'environments' though it's unclear what the difference between them is. >We then tested the harnesses on the full
public set (which researchers did not have access to at the time) |
|
> We then tested the harnesses on the full public set (which researchers did not have access to at the time). We found extreme bimodal performance across the two sets, controlling for the same frontier model...
The harness only transfers to like-environments and the intelligence for those specific games is baked into the harness by the humans who coded it for this specific challenge.
The point of ARC-AGI is to test the intelligence of AI systems in novel, but simple, environments. Having a human give it more powerful tools in a harness defeats the purpose. You should go back and read the original ARC-AGI paper to see what this is about+. Are you upset about the benchmark because frontier LLM models do so poorly exhibiting the ability to generalize when the benchmarks are released?
+ https://arxiv.org/abs/1911.01547