|
|
|
|
|
by energy123
375 days ago
|
|
That would require AIME 2024 going above 100%. There was always going to be diminishing returns in these benchmarks. It's by construction. It's mathematically impossible for that not to happen. But it doesn't mean the models are getting better at a slower pace. Benchmark space is just a proxy for what we care about, but don't confuse it for the actual destination. If you want, you can choose to look at a different set of benchmarks like ARC-AGI-2 or Epoch and observe greater than linear improvements, and forget that these easier benchmarks exist. |
|
"ARC-AGI-1: * Low: 44%, $1.64/task * Medium: 57%, $3.18/task * High: 59%, $4.16/task
ARC-AGI-2: * All reasoning efforts: <5%, $4-7/task
Takeaways: * o3-pro in line with o3 performance * o3's new price sets the ARC-AGI-1 Frontier"
- https://x.com/arcprize/status/1932535378080395332