Well, it's about the standard of the proof. When I say "demonstrate", I don't mean just experimentally, I mean theoretically, to show that the algorithm is capable of reasoning about potentially arbitrarily large instances of puzzles.
That's what the experiments have shown - once the unknown instance gets large enough, the reasoning of LLM breaks down. This is not the case with humans, who can, as noted elsewhere, do a tree search, form hypotheses, etc.
That's what the experiments have shown - once the unknown instance gets large enough, the reasoning of LLM breaks down. This is not the case with humans, who can, as noted elsewhere, do a tree search, form hypotheses, etc.