Hacker News new | ask | show | jobs
by rosstaylor90 482 days ago
What's your AIME 2025 score? https://gr.inc/RJT1990/AIME2025/
1 comments

The is the point of the AIME, it is a 3 hour closed book examination in which each answer is an integer number from 0 to 999 and should only depend on pre-calc...for a human with no calculator, notes, or internet access.

The concepts are heavily covered in the training corpus, and if people were allowed to take it more than once, with even a book let alone access to the internet it wouldn't be very hard.

Examples:

1) Find the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$

In the corpus: https://www.quora.com/In-what-bases-b-does-b-7-divide-into-9...

And one more:

3) https://artofproblemsolving.com/wiki/index.php/2025_AIME_I_P...

Is just the the number of ways to distribute k indistinguishable balls (players) into n distinguishable boxes (flavors, without exclusion, in such a way that no box is empty.

Thus in the corpus for any courses that need to cover combinatorial problems including physics, discreet math, logistics etc...

IMHO these concept classes from a typical AIME are so common, the scores you gave demonstrate that those models are doing no "general reasoning" at all and are actually failing at approximate retrieval.

I disagree, 10 years ago AIs nailing these types of competition would have been seen as very impressive. The fact goal posts can move on this now shows how much AI has progressed.

(Also the term “approximate retrieval” is a bad one - reasoning is inherently a process of chaining together associations. What matters is whether the reasoning reaches the right conclusions. Still some way to go, but already very impressive in tasks traditionally considered harbours of human reasoning!)

I disagree, 10 years ago AIs nailing these types of competition would have been seen as very impressive.

It would have been seen as witchcraft.

> What matters is whether the reasoning reaches the right conclusions

no, it doesn't. a broken clock is right twice a day, reasoning is about the journey more than the destination

RL has more than two steps...
Point is that reasoning is more about the conclusions. if your steps are wrong, your reasoning is wrong regardless of the conclusion. Poor reasoning is what could make an LLM conclude that 1 + 2 = 3 but what 2 + 1 = [some number other than 3]