| HN Mirror

Part of the magic of mathematical reasoning in humans is our ability to sidestep incompleteness theorems or undecidability headaches by simply changing the rules as befits the problem at hand: using a logical tool to solve a math problem seems largely formalizable and testable with Prolog/Lean/etc, but selecting or designing such a tool - e.g. choosing good definitions and axioms - is much more mysterious.

Put a bit more poetically: a Prolog benchmark can adequately test an LLM’s ability to create proofs in Euclidean geometry. But it will never test an LLM’s ability to reason whether a given axiomatization of geometry is actually a reasonable abstraction of physical space. And if our LLMs can do novel Euclidean proofs but are not able to meta-mathematically reason about novel axioms, then they aren’t really using intelligence. Formal logical puzzles are only a small subset of logical reasoning.

Likewise, when Euclidean proofs were a fun pastime among European upper-classes, the real work was being done by mathematicians who built new tools for projective and analytic geometry. In some sense our LLM benchmarks are focusing on the pastime and not the work. But in another sense LLMs are focusing on the tricksy and annoying sides of actually proving things, leaving humans free to think about deeper problems. So I’m not skeptical of LLMs’ utility in mathematical research, but rather the overinflated (and investor-focused) claims that this stuff is a viable path to AGI.