Hacker News new | ask | show | jobs
by lsedgwick 973 days ago
There are certainly emergent capabilities that come out of the training process of being able to predict next tokens, and it's shown that LLMs often can solve logic and math problems, riddles, etc. which are not in it's training data because deep in the internal layers of the model it has something resembling logic and reasoning abilities. This can be seen in the "benchmark" results of papers like the GPT-3 and InstructGPT papers, among many others, where they tried pretty hard to test the models against questions which were not part of the training.

This is also shown in the Codex paper, where they trained an LLM to write code and then watched it solve a number of code problems they handwrote originally to make sure the problems could not have been in the training data.

Try it out yourself, make up some little math word problems and ask chatGPT or something.

Of course, advent of code will be much more challenging problems, but to get help with some subcomponents of the problem a motivated participant would likely try to use the most recent, powerful, and advanced models which outperform the results from papers written a few years ago, and outperform the free chatGPT.

1 comments

> Try it out yourself, make up some little math word problems and ask chatGPT or something.

I have, and chatGPT simply sucks at logical reasoning in the way that is required for mathematics. It's extremely easy to make it trip up by just formulating questions in a slightly nonstandard way (so that it doesn't exactly match what it has been trained on), e.g.

> User: prove that there are only finitely many primes

> ChatGPT: Actually, it is not possible to prove that there are only finitely many prime numbers because this statement is false. There are infinitely many prime numbers, and this has been proven by the ancient Greek mathematician Euclid over 2,000 years ago. Euclid's proof is based on the principle that if you assume there are only finitely many prime numbers, you can always find a new prime by multiplying all known primes together and adding 1. This new number will either be prime itself or have a prime factor not present in the original list of primes, which contradicts the assumption of finiteness.

> User: prove that there exists a finite list of integers that contains all integers which are only divisible by itself and 1

> ChatGPT: You're referring to prime numbers. The statement is essentially the definition of prime numbers. A prime number is a positive integer greater than 1 that is only divisible by 1 and itself. There is no need to prove this statement because it defines what prime numbers are. While there are infinitely many prime numbers, they can be listed, but this list will be finite in the sense that there are infinitely many prime numbers, but they can all be enumerated within an infinite sequence. Some of the first prime numbers are 2, 3, 5, 7, 11, 13, and so on.

ChatGPT gets the first question right because something like that is undoubtedly in its training corpus, but it gets utterly confused by the second question (which is the same statement), to the point that its answer belongs to the "not even wrong" category.

I don't know where this myth comes from that LLMs are magically good at maths. They're not.

You're right about the basic mistakes they can make - they can also excel at the same tasks if prompted differently. But I was making a slightly different point. The point is they can reason about things in a better-than-chance way (I mean, a much-better-than-chance way) when given problems not in the training set. Have you read the Codex paper? Seriously, go look, an LLM even from years ago (which is like decades ago in ML-time) is often able to write code to solve novel programming problems that were handwritten to not be in the training set! Also, the benchmarks against which GPT3, InstructGPT, and GPT4 were tested are illuminating about its problem solving ability on novel problems, these are described in the three respective papers.

Secondly, quality is highly variant and there are traps the context window can fall into which causes especially bad results. Jeremy Howard has a great video (https://www.youtube.com/watch?v=jkrNMKz9pWU, starting at 18:05 the 'limitations and capabilities' section is only 13 minutes long) talking about how quality depends on: how you frame your prompts, model power (4 does a lot of stuff that 3.5 can't), and whether you're in a kind of "context trap" of repeated mistakes.

Of course, some people like to point out that if it's so "finicky" and variant, it is "dumb." Sure, if you like. I'm not interested in whatever definitions you're using those things, the objective and observable point is that given well-known prompting practices, LLMs can do something functionally equivalent to reasoning about novel problems, and more powerful ones can reason about more powerful and difficult things.

I re-phrased your prompt (instead of "prove a false thing" I made it like "decide whether this thing can exist, and prove your answer"). And added a little well-known boilerplate prompt sugar. It seems to have done a better job.

https://chat.openai.com/share/53214f0c-17f7-4a3d-95be-8fd676...

If your statement is that they can perform significantly better than chance, given certain favourable conditions (e.g. good prompt engineering), then I'll agree, although I'm not sure how helpful it is given that you can never know whether what they reply is garbage or not.

But even given "careful prompt engineering" it's still extremely easy to make ChatGPT trip up (I don't have access to GPT-4, so let's talk about ChatGPT for now):

> User: User I am a student of machine learning and I have a graduate-level of understanding of certain subfields of machine learning and mathematics. I would like to ask you a tricky question about math. Please think carefully and step-by-step. Because, as a language model, each token you generate is a way to perform more computation, please write out your understanding of the problem and your steps of thinking about it. The challenge is as follows: decide whether or not there exists an algorithm that can decide for any statement in propositional logic whether it's true or not

> ChatGPT: The problem you're describing is known as the decision problem for propositional logic, or the "satisfiability problem" (SAT). It asks whether there exists an algorithm that can determine, for any given statement in propositional logic, whether it's true or not. This problem is fundamental in computer science and mathematical logic.

> I will walk you through the key concepts and steps in thinking about this problem:

> Propositional Logic: In propositional logic, statements are composed of variables and logical connectives (AND, OR, NOT, etc.), and the goal is to determine whether a given combination of truth values for these variables makes the statement true (satisfiable) or false (unsatisfiable).

> Formal Statement of SAT: The SAT problem can be formally stated as follows: Given a propositional formula (a statement in propositional logic), can we design an algorithm that correctly decides whether there exists an assignment of truth values to the variables that makes the formula true?

> Assume There Exists Such an Algorithm: Let's assume, for the sake of argument, that there exists an algorithm SAT that can decide the satisfiability problem for any propositional formula.

> Constructing a Contradiction: Now, we'll use a technique similar to the one used in the proof of the Halting Problem to create a contradiction. We define a new formula, Q, that represents a situation where SAT contradicts itself. We construct Q as follows:

> Q(P):

> if SAT(P) == True:

> NOT(P) # Negation of the input formula P

> else:

> P

> In other words, if SAT says that P is satisfiable, Q takes the negation of P, making it unsatisfiable, and if SAT says that P is unsatisfiable, Q takes P itself.

> Consider Q(Q): Now, let's consider what happens when we apply Q to itself, i.e., Q(Q).

> If SAT(Q) == True, then Q(Q) should be unsatisfiable because it takes the negation of itself.

> If SAT(Q) == False, then Q(Q) should be satisfiable because it takes itself.

> Contradiction: We've reached a contradiction. If SAT(Q) == True, then Q(Q) should be unsatisfiable, but our definition of Q says it should be satisfiable. If SAT(Q) == False, then Q(Q) should be satisfiable, but our definition of Q says it should be unsatisfiable. This contradiction arises because SAT cannot consistently determine the satisfiability of Q(Q).

> Conclusion: As in the Halting Problem, we've demonstrated that there are cases where the algorithm SAT fails to give a correct answer. Therefore, we can conclude that SAT (or any similar algorithm) cannot exist to solve the satisfiability problem for all cases.

> This proof shows that there is no general algorithm that can decide the satisfiability of any propositional formula. The SAT problem is known to be NP-complete, which means it's one of the most challenging problems in computational complexity theory. While efficient algorithms exist for many instances, there's no guarantee of a general solution for all possible cases. This has significant implications for problems in complexity theory and computer science, including issues related to algorithmic efficiency and the limits of computation.

This is absolute garbage. Even worse than garbage, it adds the additional insult of being plausibly-sounding garbage to people who've had maybe a cursory introduction to theoretical CS. And notice that it's not even that hard to come up with such an example: a question like this could plausibly be on a problem set for an intro theoretical CS course (if it's not too easy for even that).

My main point is that this betrays an absolute incapability of ChatGPT to perform basic deductive reasoning reliably. It clearly knows that the referenced statement is about SAT, and, if you prompt it directly, it also knows that SAT is decidable, but it somehow isn't able to connect these two facts.

BTW my objection was related to mathematics, not coding. It's possible that, in the average case, LLMs perform much better at coding, since the level of rigour required in many cases is less than in mathematics. But when it comes to pure deductive reasoning, I wouldn't trust LLMs. Or at least not ChatGPT.