| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by oofbaroomf 400 days ago
	Nice to see that Sonnet performs worse than o3 on AIME but better on SWE-Bench. Often, it's easy to optimize math capabilities with RL but much harder to crack software engineering. Good to see what Anthropic is focusing on.

1 comments

j_maffe 400 days ago

That's a very contentious opinion you're stating there. I'd say LLMs have surpassed a larger percentage of SWEs in capability than they have for mathematicians.

link

oofbaroomf 400 days ago

Mathematicians don't do high school math competitions - the benchmark in question is AIME.

Mathematicians generally do novel research, which is hard to optimize for easily. Things like LiveCodeBench (leetcode-style problems), AIME, and MATH (similar to AIME) are often chosen by companies so they can flex their model's capabilities, even if it doesn't perform nearly as well in things real mathematicians and real software engineers do.

link

j_maffe 400 days ago

Ok then you should clarify that you meant math benchmarks and not math capabilities.

link