In my experience, it's 100%. Not 95%, not 99%. Unless GPT5 (and O4-mini) were colluding with Math Academy behind the scenes specifically to be wrong about something, it just doesn't get any of this content wrong.
And keep in mind, what it's getting right is trickier than just answering Calc I questions: it's taking an answer I give it, calculating the correct answer itself, selecting its answer over mine, and then spotting where I e.g. forgot to check the domain of a variable inside a log.
Yeah, they seem to be there on high school math problems today, there aren't that many variations on them and there are billions of examples of data on them so LLM can saturate those.
Just don't assume they are this reliable on solving real world math tasks yet, those are more varied still and stump models.
I've used LLMs to try to help digest some advanced maths. Eg. "Explain the number field seive with lots of numeric examples".
Yes the numeric examples often don't work. The consequences of this though are similar to a failed web search. As in it's not a big deal and when it does work it's very helpful.
Maths is one of those things with so much objectivity that even the LLM usually realizes it has failed to create a numeric example. "Here the numeric example breaks down since we cannot find a congruence of squares in this example without finding more B-smooth numbers in step 1". Ok that's a shame, i would have loved to see an end to end numeric example.
I think people get too hung up on any possibility of LLMs not being perfect while still being extremely helpful.
It's a term i used to explain that in 'thinking' mode LLMs will read their own output and call out things like incorrect math statements before posting to the user.
Now you probably want a debate about the term 'thinking' mode but i cbf with that. It's pretty clear what was meant and semantic arguments suck. Don't do that.
And keep in mind, what it's getting right is trickier than just answering Calc I questions: it's taking an answer I give it, calculating the correct answer itself, selecting its answer over mine, and then spotting where I e.g. forgot to check the domain of a variable inside a log.