|
|
|
|
|
by mrandish
412 days ago
|
|
I'm not even following AI model performance testing that closely but I'm hearing increasing reports they're inaccurate due to accidental or intentional test data leaking into training data and other ways of training to the test. Also, ARC AGI reported they've been unable to independently replicate OpenAI's claimed breakthrough score from December. There's just too much money at stake now to not treat all AI model performance testing as an adversarial, no-holds-barred brawl. The default assumption should be all entrants will cheat in any way possible. Commercial entrants with large teams of highly-incentivized people will search and optimize for every possible advantage - if not outright cheat. As a result, smaller academic, student or community teams working part-time will tend to score lower than they would on a level playing field. |
|
Can you elaborate on this? Where did ARC AGI report that? From ARC AGI[0]:
> ARC Prize Foundation was invited by OpenAI to join their “12 Days Of OpenAI.” Here, we shared the results of their first o3 model, o3-preview, on ARC-AGI. It set a new high-water mark for test-time compute, applying near-max resources to the ARC-AGI benchmark.
> We announced that o3-preview (low compute) scored 76% on ARC-AGI-1 Semi Private Eval set and was eligible for our public leaderboard. When we lifted the compute limits, o3-preview (high compute) scored 88%. This was a clear demonstration of what the model could do with unrestricted test-time resources. Both scores were verified to be state of the art.
That makes it sound like ARC AGI were the ones running the original test with o3
What they say they haven't been able to reproduce is o3-preview's performance with the production versions of o3. They attribute this to the production versions being given less compute than the versions they ran in the test
[0] https://arcprize.org/blog/analyzing-o3-with-arc-agi