| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by getnormality 264 days ago

That's very interesting. Maybe you are doing this the right way, and my concern as a math educator is for the people who may struggle to stay on the straight and narrow, or know what the straight and narrow is in this brave new world.

Where I see deficiencies is not so much in the calculations. When a problem class has a solution algorithm and 10,000 worked examples online, I'm not too surprised that the LLM generalizes pretty reliably to that problem class.

The problem I find is more when it's tricky, out-of-distribution, not entirely on the "happy path" of what the 10,000 examples are about. In that case, LLM responses quickly become irrelevant, illogical, and Pavlovian. It's the math version of messing up the surgeon riddle when presented with a minor variation that is logically very easy, but isn't the popular version everyone talks about [1].

[1] https://www.thealgorithmicbridge.com/p/openai-researchers-ha...

2 comments

simonw 264 days ago

The International Mathematical Olympiad challenges should be pretty safely out of distribution. Gemini and OpenAI's best research models both scored gold on that this year.

link

getnormality 264 days ago

When they make a model with those abilities publicly available, I'll happily experiment with it, and I'd anticipate reporting that it is a lot better than what I experienced in the past.

link

simonw 264 days ago

The Gemini one is out now but expensive:

> Gemini Deep Think, our SOTA model with parallel thinking that won the IMO Gold Medal , is now available in the Gemini App for Ultra subscribers!!

https://twitter.com/OfficialLoganK/status/195126226151265943...

link

tptacek 264 days ago

No, we're not going to move the goalposts here. You can tweak any argument so that the thread goes nowhere and nobody can update their mental models by positing a sufficiently misguided user of a piece of technology. I'm saying: LLMs are quite good at math tutoring, in many ways probably significantly better than human tutors (they're tireless, can explain any concept 50 different ways, and can rattle off individualized problem sets in seconds). I made that claim, and you pushed back saying that anything I saw "needed to be validated by an expert". You even said that anything I said was an unreliable narrator because I'm studying math. No, to all of this.

link