| HN Mirror

If your statement is that they can perform significantly better than chance, given certain favourable conditions (e.g. good prompt engineering), then I'll agree, although I'm not sure how helpful it is given that you can never know whether what they reply is garbage or not.

But even given "careful prompt engineering" it's still extremely easy to make ChatGPT trip up (I don't have access to GPT-4, so let's talk about ChatGPT for now):

> User: User I am a student of machine learning and I have a graduate-level of understanding of certain subfields of machine learning and mathematics. I would like to ask you a tricky question about math. Please think carefully and step-by-step. Because, as a language model, each token you generate is a way to perform more computation, please write out your understanding of the problem and your steps of thinking about it. The challenge is as follows: decide whether or not there exists an algorithm that can decide for any statement in propositional logic whether it's true or not

> ChatGPT: The problem you're describing is known as the decision problem for propositional logic, or the "satisfiability problem" (SAT). It asks whether there exists an algorithm that can determine, for any given statement in propositional logic, whether it's true or not. This problem is fundamental in computer science and mathematical logic.

> I will walk you through the key concepts and steps in thinking about this problem:

> Propositional Logic: In propositional logic, statements are composed of variables and logical connectives (AND, OR, NOT, etc.), and the goal is to determine whether a given combination of truth values for these variables makes the statement true (satisfiable) or false (unsatisfiable).

> Formal Statement of SAT: The SAT problem can be formally stated as follows: Given a propositional formula (a statement in propositional logic), can we design an algorithm that correctly decides whether there exists an assignment of truth values to the variables that makes the formula true?

> Assume There Exists Such an Algorithm: Let's assume, for the sake of argument, that there exists an algorithm SAT that can decide the satisfiability problem for any propositional formula.

> Constructing a Contradiction: Now, we'll use a technique similar to the one used in the proof of the Halting Problem to create a contradiction. We define a new formula, Q, that represents a situation where SAT contradicts itself. We construct Q as follows:

> Q(P):

> if SAT(P) == True:

> NOT(P) # Negation of the input formula P

> else:

> P

> In other words, if SAT says that P is satisfiable, Q takes the negation of P, making it unsatisfiable, and if SAT says that P is unsatisfiable, Q takes P itself.

> Consider Q(Q): Now, let's consider what happens when we apply Q to itself, i.e., Q(Q).

> If SAT(Q) == True, then Q(Q) should be unsatisfiable because it takes the negation of itself.

> If SAT(Q) == False, then Q(Q) should be satisfiable because it takes itself.

> Contradiction: We've reached a contradiction. If SAT(Q) == True, then Q(Q) should be unsatisfiable, but our definition of Q says it should be satisfiable. If SAT(Q) == False, then Q(Q) should be satisfiable, but our definition of Q says it should be unsatisfiable. This contradiction arises because SAT cannot consistently determine the satisfiability of Q(Q).

> Conclusion: As in the Halting Problem, we've demonstrated that there are cases where the algorithm SAT fails to give a correct answer. Therefore, we can conclude that SAT (or any similar algorithm) cannot exist to solve the satisfiability problem for all cases.

> This proof shows that there is no general algorithm that can decide the satisfiability of any propositional formula. The SAT problem is known to be NP-complete, which means it's one of the most challenging problems in computational complexity theory. While efficient algorithms exist for many instances, there's no guarantee of a general solution for all possible cases. This has significant implications for problems in complexity theory and computer science, including issues related to algorithmic efficiency and the limits of computation.

This is absolute garbage. Even worse than garbage, it adds the additional insult of being plausibly-sounding garbage to people who've had maybe a cursory introduction to theoretical CS. And notice that it's not even that hard to come up with such an example: a question like this could plausibly be on a problem set for an intro theoretical CS course (if it's not too easy for even that).

My main point is that this betrays an absolute incapability of ChatGPT to perform basic deductive reasoning reliably. It clearly knows that the referenced statement is about SAT, and, if you prompt it directly, it also knows that SAT is decidable, but it somehow isn't able to connect these two facts.

BTW my objection was related to mathematics, not coding. It's possible that, in the average case, LLMs perform much better at coding, since the level of rigour required in many cases is less than in mathematics. But when it comes to pure deductive reasoning, I wouldn't trust LLMs. Or at least not ChatGPT.