Hacker News new | ask | show | jobs
by benchmarkist 584 days ago
Very cool. It'll be nice to have a benchmark that can be used to validate abstract reasoning capabilities because the hype is really starting to get out of hand.
2 comments

I wonder if the best benchmark is a Prolog program that generates tests of logical reasoning. You could have a functionally infinite stream of test cases!
Part of the magic of mathematical reasoning in humans is our ability to sidestep incompleteness theorems or undecidability headaches by simply changing the rules as befits the problem at hand: using a logical tool to solve a math problem seems largely formalizable and testable with Prolog/Lean/etc, but selecting or designing such a tool - e.g. choosing good definitions and axioms - is much more mysterious.

Put a bit more poetically: a Prolog benchmark can adequately test an LLM’s ability to create proofs in Euclidean geometry. But it will never test an LLM’s ability to reason whether a given axiomatization of geometry is actually a reasonable abstraction of physical space. And if our LLMs can do novel Euclidean proofs but are not able to meta-mathematically reason about novel axioms, then they aren’t really using intelligence. Formal logical puzzles are only a small subset of logical reasoning.

Likewise, when Euclidean proofs were a fun pastime among European upper-classes, the real work was being done by mathematicians who built new tools for projective and analytic geometry. In some sense our LLM benchmarks are focusing on the pastime and not the work. But in another sense LLMs are focusing on the tricksy and annoying sides of actually proving things, leaving humans free to think about deeper problems. So I’m not skeptical of LLMs’ utility in mathematical research, but rather the overinflated (and investor-focused) claims that this stuff is a viable path to AGI.

You could but most LLMs can't solve sudoku puzzles even though the training corpus already contains books on logic, constraint propagation, and state space exploration with backtracking.
the LLM just doesn't have enough compute I think probably step by step it could do it. anyways most leading LLMs can write a backtracking search program to do it and I think +tool use should be counted
I mean, this benchmark is really hard.

I don't think it's a requirement that a system claiming to be AGI should be able to solve these problems, 99.99% of humans can't either.

An AGI is often claimed to be a general purpose problem solver and these are exactly the types of problems that a general purpose problem solver would be able to solve if given access to a mathematical library. All existing LLMs have been trained on abstract mathematics and logic but it is obvious that they are incapable of abstract logical reasoning, e.g. solving sudoku puzzles.
Here is my prediction, FWIW: the hard part of the problem has already been solved, in the following technical sense: there is a few 1000 lines program that has not been invented yet, but it will be invented soon, that loads a current LLM model, runs fast on current hardware, and you will deem it to be an AGI. In other words, the conditional Kolmogorov complexity of undisputable AGI given the Llama weights is only a few 1000 bytes. We are at the pre-AlphaGo, post Clark-Storkey stage of reasoning. That's my guess, anyway.
I think you are likely right but coming up with that final inference strategy is still "the hard part" IMO. Not in terms of computation, but in terms of algorith development.
99.9% of the population would not be able to solve these problems given a year and access to every piece of mathematical literature ever written (except the solutions to these problems of course).

Saying that you need to solve these to be considered AGI is ridiculously strict.