Hacker News new | ask | show | jobs
by diyer22 127 days ago
MathArena uses newly released competition sets and evaluates models close to the event. They also mark models released after the competition date as potential contamination.

On Feb 6, the just-concluded AIME 2026 I, Step 3.5 Flash take first place. Step 3.5 Flash was released on Feb 1, making cheating impossible.