Why can't a next token predictor do math? Humans aren't calculators either, but we can do math.
If you want proof just look at the benchmarks. Modern frontier models can get basically perfect accuracy on American Invitational Mathematics Examination tests: https://matharena.ai/?comp=aime--aime_2026