|
|
|
|
|
by threepts
60 days ago
|
|
Why don't they ask their premier model to generate a bench for them? Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy. Leaderboard: https://arcprize.org/leaderboard (Most premier models don't even pass 5 percent.) |
|