|
|
|
|
|
by harshreality
219 days ago
|
|
Even if a deep thinking LLM like Opus can get some math questions right when that depends on identifying the type of problem and applying a learned procedure, it's not going to be able to evaluate the pedagogy of math books it's never encountered, or at most was fringe material in its training set. I'm also referring to the faster models, not the slow and expensive deep thinking ones which I have little experience with. I don't see how reasoning would enable deep thinking models to meaningfully evaluate textbook pedagogy, either. |
|
They DO understand what they are doing. When I ask it to solve math problems, it goes through the several (many) steps involved (e.g. e.g. "apply the chain rule" while doing partial differentiation on a term in a Jacobian matrix). It gets pretty tedious when solving systems of linear equations, where it goes through each step of the Gauss-Jordan elimination while doing an LU decomposition, row by row. But one learns to ignore the blah-blah. Step by step, in absolutely ridiculous detail. The point: they absolutely 100% understand what they are doing, and understand it in minute detail.
It's clearly NOT regurgitating something that it has literally seen before, because the level of detail is beyond ridiculous for a human. It is applying generalized rules to specific concrete problems, and doing so with some level of strategic thinking.
Where did it learn those generalized principles, and how did it learn to do that? With absolute certainty, there are math textbooks among the materials they have been trained on. And they certainly learned it from SOMEWHERE. Probably math textbooks. How did they learn to generalize and think strategically? Well, that's the big mystery, isn't it? But they do.
The very best models achieve high scores on Math Olympiad problem sets (so competitive with some of the best minds on the planet). And Terrence Tau (greatest living mathematician) declares state-of-the-art models to be "better than most of my post-graduate students".
And what they can and cannot do is increasing by leaps and bounds on a weekly or monthly basis right now. It's hard to keep up. I frequently find that they can do things this week, that they could not do a week or a month ago. Startling, and quite utterly amazing.
Most of the time, I am using Claude Sonnet 4.5 as my coding agent, for which I pay $10/month. Measured IQ of 110, I think, with an IQ of 120 if you flip it into thinking mode. But only because there isn't enough undergraduate level mathematics in a standard IQ test. Claude Sonnet 4.5 is also available for free here: https://claude.ai/chats (during periods of heavy load, it may fall back to simpler models). I often use the free web interface instead of the Coding Agent interface for math problems, because it's easier to read mathematical equations in the browser version. version). And I generally use the free version of Claude instead of Google Search these days.