| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simianwords 217 days ago

“ When researchers tested the same performance on a new set of benchmark questions, they noticed that models experienced “significant performance drops.””

This is very misleading because the generalisation ability of LLMs is very very high. It doesn’t just memorise problems - that’s nonsense.

At high school level maths you genuinely can’t get gpt-5 thinking to make a single mistake. Not possible at all. Unless you give some convoluted ambiguous prompt that no human can understand. If you assume I’m correct, how does gpt memorise then?

In fact even undergraduate level mathematics is quite simple for gpt-5 thinking.

IMO gold was won.. by what? Memorising solutions?

I challenge people to find ONE example that gpt-5 thinking gets wrong in high school or undergrad level maths. I could not achieve it. You must allow all tools though.

4 comments

YeGoblynQueenne 217 days ago

The best performance on GSM8K is currently at 0.973, so less than perfect [1]. Given that GSM8K is a grade school math question data set, and the leading LLMs still don't get all answers correctly it's safe to assume that they won't get all high school questions' answers correctly either, since those are going to be harder than grade school questions. This means there has got to be at least one example that GPT-5 as well as every other LLM fails on [2].

If you don't think that's the case I think it's up to you to show that it's not.

___________________

[1] GSM8K leaderboard: https://llm-stats.com/benchmarks/gsm8k

[2] This is regardless of what GSM8K or any other benchmark is measuring.

link

simianwords 217 days ago

“ In many reasoning-heavy benchmarks, o1 rivals the performance of human experts. Recent frontier models1 do so well on MATH2 and GSM8K that these benchmarks are no longer effective at differentiating models. We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America.”

https://openai.com/index/learning-to-reason-with-llms/

The benchmark was so saturated that they didn’t even bother running it on the newer models.

Which is interesting because it shows the rapid progress LLMs are making.

I’m also making a bigger claim - you can’t get gpt-5 thinking to make a mistake in undergraduate level maths. At least it would be comparable in performance to a good student.

link

simianwords 217 days ago

Sure I didn’t say it was perfect. But questioning the essence of the article.

link

geoduck14 217 days ago

>At high school level maths you genuinely can’t get gpt-5 thinking to make a single mistake. Not possible at all.

If you give an LLM an incomplete question, it will guess at an answer. They don't know what they don't know, and they are trained to guess

link

simianwords 217 days ago

Example?

link

autop0ietic 217 days ago

I would think GPT5 is great at high school level math but what high school level math problems are not in the training data?

I think the problem is that GPT5 is not "memorising" but conversely that doesn't automatically mean it is "reasoning". These are human attributes that we are trying to equate to machines and it just causes confusion.

link

simianwords 217 days ago

Make up one yourself and try it?

link

callmesnek 217 days ago

"You must allow all tools though"

link

y0eswddl 217 days ago

"if you don't let chatgpt search the internet or use python code, it doesn't count..."

look at those goalposts go!

link

simianwords 216 days ago

Ok don’t allow search or python. Can you come up with an example? Probably not.

link