The LLMs are deterministic but they only return a probability distribution over following tokens. The tokens the user sees in the response are selected by some typically stochastic sampling procedure.
Assuming decent data, it won't be stochastic sampling for many math operations/input combinations. When people suggest LLMs with tokenization could learn math, they aren't suggesting a small undertrained model trained on crappy data.
I mean, this depends on your sampler. With temp=1 and sampling from the raw output distribution, setting aside numerics issues, these models output nonzero probability of every token at each position
A large model well trained on good data will have logits so negative for something like "1+1=" -> 3 that they won't come up in practice unless you sample in a way to deliberately misuse the model.