I'm with you. I get that this is akin to asking a human, because we're trying to reason, so we will bring along (assumedly) unavoidable deficiencies of human reasoning. But if I were to ask a human genius this question, ne would grab a calculator and employ it as ne did the rest of ner reasoning.
So it seems like we should probably teach LLMs to "use a calculator", rather than try to get them to be more right when doing math 'in their head'.
Solving that will be a much bigger deal but it's at odds with producing a highly accurate emulation of human thought and language. Language models can serve as tools to understand and experiment with logic formulated as natural language but it isn't their primary purpose. What you're asking is equivalent to creating an auditable trace of everything that goes into making a statement which is pretty much impossible even for the person making a statement. We can get close by limiting ourselves to narrow domains like mathematics but even then someone can come along and question the premises on which we construct such a system. I'm not saying it isn't worth pursuing, it just isn't the standard that we should hold a model to when we ourselves are incapable of it. The goal here is to create a system capable of doing the things that a human can do. If you prefer to have a system that behaves within the confines of a mathematical formalism with well defined rules then build that model instead.
So it seems like we should probably teach LLMs to "use a calculator", rather than try to get them to be more right when doing math 'in their head'.