Hacker News new | ask | show | jobs
by lsedgwick 977 days ago
You're right about the basic mistakes they can make - they can also excel at the same tasks if prompted differently. But I was making a slightly different point. The point is they can reason about things in a better-than-chance way (I mean, a much-better-than-chance way) when given problems not in the training set. Have you read the Codex paper? Seriously, go look, an LLM even from years ago (which is like decades ago in ML-time) is often able to write code to solve novel programming problems that were handwritten to not be in the training set! Also, the benchmarks against which GPT3, InstructGPT, and GPT4 were tested are illuminating about its problem solving ability on novel problems, these are described in the three respective papers.

Secondly, quality is highly variant and there are traps the context window can fall into which causes especially bad results. Jeremy Howard has a great video (https://www.youtube.com/watch?v=jkrNMKz9pWU, starting at 18:05 the 'limitations and capabilities' section is only 13 minutes long) talking about how quality depends on: how you frame your prompts, model power (4 does a lot of stuff that 3.5 can't), and whether you're in a kind of "context trap" of repeated mistakes.

Of course, some people like to point out that if it's so "finicky" and variant, it is "dumb." Sure, if you like. I'm not interested in whatever definitions you're using those things, the objective and observable point is that given well-known prompting practices, LLMs can do something functionally equivalent to reasoning about novel problems, and more powerful ones can reason about more powerful and difficult things.

I re-phrased your prompt (instead of "prove a false thing" I made it like "decide whether this thing can exist, and prove your answer"). And added a little well-known boilerplate prompt sugar. It seems to have done a better job.

https://chat.openai.com/share/53214f0c-17f7-4a3d-95be-8fd676...

1 comments

If your statement is that they can perform significantly better than chance, given certain favourable conditions (e.g. good prompt engineering), then I'll agree, although I'm not sure how helpful it is given that you can never know whether what they reply is garbage or not.

But even given "careful prompt engineering" it's still extremely easy to make ChatGPT trip up (I don't have access to GPT-4, so let's talk about ChatGPT for now):

> User: User I am a student of machine learning and I have a graduate-level of understanding of certain subfields of machine learning and mathematics. I would like to ask you a tricky question about math. Please think carefully and step-by-step. Because, as a language model, each token you generate is a way to perform more computation, please write out your understanding of the problem and your steps of thinking about it. The challenge is as follows: decide whether or not there exists an algorithm that can decide for any statement in propositional logic whether it's true or not

> ChatGPT: The problem you're describing is known as the decision problem for propositional logic, or the "satisfiability problem" (SAT). It asks whether there exists an algorithm that can determine, for any given statement in propositional logic, whether it's true or not. This problem is fundamental in computer science and mathematical logic.

> I will walk you through the key concepts and steps in thinking about this problem:

> Propositional Logic: In propositional logic, statements are composed of variables and logical connectives (AND, OR, NOT, etc.), and the goal is to determine whether a given combination of truth values for these variables makes the statement true (satisfiable) or false (unsatisfiable).

> Formal Statement of SAT: The SAT problem can be formally stated as follows: Given a propositional formula (a statement in propositional logic), can we design an algorithm that correctly decides whether there exists an assignment of truth values to the variables that makes the formula true?

> Assume There Exists Such an Algorithm: Let's assume, for the sake of argument, that there exists an algorithm SAT that can decide the satisfiability problem for any propositional formula.

> Constructing a Contradiction: Now, we'll use a technique similar to the one used in the proof of the Halting Problem to create a contradiction. We define a new formula, Q, that represents a situation where SAT contradicts itself. We construct Q as follows:

> Q(P):

> if SAT(P) == True:

> NOT(P) # Negation of the input formula P

> else:

> P

> In other words, if SAT says that P is satisfiable, Q takes the negation of P, making it unsatisfiable, and if SAT says that P is unsatisfiable, Q takes P itself.

> Consider Q(Q): Now, let's consider what happens when we apply Q to itself, i.e., Q(Q).

> If SAT(Q) == True, then Q(Q) should be unsatisfiable because it takes the negation of itself.

> If SAT(Q) == False, then Q(Q) should be satisfiable because it takes itself.

> Contradiction: We've reached a contradiction. If SAT(Q) == True, then Q(Q) should be unsatisfiable, but our definition of Q says it should be satisfiable. If SAT(Q) == False, then Q(Q) should be satisfiable, but our definition of Q says it should be unsatisfiable. This contradiction arises because SAT cannot consistently determine the satisfiability of Q(Q).

> Conclusion: As in the Halting Problem, we've demonstrated that there are cases where the algorithm SAT fails to give a correct answer. Therefore, we can conclude that SAT (or any similar algorithm) cannot exist to solve the satisfiability problem for all cases.

> This proof shows that there is no general algorithm that can decide the satisfiability of any propositional formula. The SAT problem is known to be NP-complete, which means it's one of the most challenging problems in computational complexity theory. While efficient algorithms exist for many instances, there's no guarantee of a general solution for all possible cases. This has significant implications for problems in complexity theory and computer science, including issues related to algorithmic efficiency and the limits of computation.

This is absolute garbage. Even worse than garbage, it adds the additional insult of being plausibly-sounding garbage to people who've had maybe a cursory introduction to theoretical CS. And notice that it's not even that hard to come up with such an example: a question like this could plausibly be on a problem set for an intro theoretical CS course (if it's not too easy for even that).

My main point is that this betrays an absolute incapability of ChatGPT to perform basic deductive reasoning reliably. It clearly knows that the referenced statement is about SAT, and, if you prompt it directly, it also knows that SAT is decidable, but it somehow isn't able to connect these two facts.

BTW my objection was related to mathematics, not coding. It's possible that, in the average case, LLMs perform much better at coding, since the level of rigour required in many cases is less than in mathematics. But when it comes to pure deductive reasoning, I wouldn't trust LLMs. Or at least not ChatGPT.