Hacker News new | ask | show | jobs
by nmca 443 days ago
The linked USAMO math results are in an exam that requires proofs. The same authors, on the same website, ran AIME 2025 shortly after it happened and found it was totally consistent with the o1 announcement numbers; the difference being that the AIME requires only short answers and no proof.

If you are a skilled mathematician, it is quite easy to verify both that (as of 7th April) models excel at novel calculations on held out problems and mostly shit the bed when asked for proofs.

Gary cites these USAMO as evidence of contamination influencing benchmark results, but that view is not consistent with strong performance of the models on clearly held out tasks (arc test, AIME 25, HMMT 25, etc etc).

If you really care, you can test this by inventing problems! It is a very very verifiable claim about the world.

In any case, this is not the pundit you want. There are many ways to make a bear case that are much saner than this.