Hacker News new | ask | show | jobs
by jmmcd 467 days ago
These puzzles probably have more in common with "Zebra puzzles" (eg https://www.zebrapuzzles.com/) than Cluedo (USA Clue) itself. I've been doing some one-off experiments with Zebra puzzles recently. All the reasoning models generate an enormous batch of text, trying out possibilities, backtracking, and sometimes getting confused.

From what I can see (not rigorous): Claude 3.7 fails, ChatGPT with reasoning succeeds, DeepSeek with reasoning succeeds.

But of course the best way for a model to solve a problem like this is to translate it into a constraint satisfaction problem, and write out Python code to call a CSP solver.

1 comments

> But of course the best way for a model to solve a problem like this is to translate it

Which means that when you asked it (e.g.) whether A is better than B (as a Decision Support System), it should write a program to decide it instead of "guessing it" from the network.

You are stating that, since the issue is general, LLMs should write programs to produce their own outputs, instead of their standard output.

> since the issue is general

I'm not sure what that means specifically. I don't agree overall. Only certain types of problems encountered by LLMs map cleanly to well-understood problems where existing solvers are perfect.

I am stating that since the ability to solve those puzzles is critical in an intelligence, and the general questions I can think of require an intelligence as processor, if to solve those problems the LLMs "should write code" then in general they should.

All problems require proficient reasoning to get a proper solution - not only puzzles. Without proper reasoning you can get some "heuristic", which can only be useful if you only needed an unreliable result based on "grosso modo" criteria.

> Without proper reasoning you can get some "heuristic"

Right, but the question is whether this is good enough. And what counts as "proper". A lot of what we call proper reasoning is still quite informal, and even mathematics is usually not formal enough to be converted directly into a formal language like Coq.

So this is a deep question: is talking reasoning? Humans talk (out loud, or in their heads). Are they then reasoning? Sure, some of what happens internally is not just self-talk, but the thought experiment goes: if the problem is not completely ineffable, then (a bit like Borges' library) there is some 1000-word text which is the best possible reasoned, witty, English-language 1000-word solution to the problem. In principle, an LLM can generate that.

If your goal is a reductio, ie my statement must be false since it implies models should write code for every problem - then I disagree, because while the ability to solve these problems might be a requirement to be deemed "an intelligence", nonetheless many other problems which require an intelligence don't require the ability to solve these problems.

> Are they then reasoning

Reasoning properly is at least operating through processes that output correct results.

> Borges' library

Which in fact is exactly made of "non-texts" (the process that produces them is `String s = numToString(n++);` - they are encoded numbers, not "weaved ideas").

> many other problems which require an intelligence don't require the ability to solve these problems

Which ones? Which problems that demand producing correct solutions could be solved by a general processor which could not solve a "detective game"?

I can't reply to your new post below, I guess the thread is too deep. But you've bit the bullet and stated that what humans do is not reasoning, I think.

You didn't like "what colour is the sky" (without looking), ok. "Given the following [unseen during training] page of text, can you guess what emotion the main character is feeling at the end?" This is a problem that a human can solve, and many LLMs can solve, even if they can't solve the detective puzzle. In case it doesn't sound important, this can be reframed as a customer-service sentiment-recognition problem.

> Reasoning properly is at least operating through processes that output correct results.

Human "reasoning" (ie speech or self-talk that sounds a bit like reasoning) often outputs correct results. Does "often" fit the definition?

> Which problems that demand producing correct solutions could be solved by a processor which could not solve a "detective game"?

For example, "what colour is the sky right now?". A lot of people could solve this (even if they haven't looked outside), and so could a lot of language models, which can't solve this detective game.