| > Try it out yourself, make up some little math word problems and ask chatGPT or something. I have, and chatGPT simply sucks at logical reasoning in the way that is required for mathematics. It's extremely easy to make it trip up by just formulating questions in a slightly nonstandard way (so that it doesn't exactly match what it has been trained on), e.g. > User: prove that there are only finitely many primes > ChatGPT: Actually, it is not possible to prove that there are only finitely many prime numbers because this statement is false. There are infinitely many prime numbers, and this has been proven by the ancient Greek mathematician Euclid over 2,000 years ago. Euclid's proof is based on the principle that if you assume there are only finitely many prime numbers, you can always find a new prime by multiplying all known primes together and adding 1. This new number will either be prime itself or have a prime factor not present in the original list of primes, which contradicts the assumption of finiteness. > User: prove that there exists a finite list of integers that contains all integers which are only divisible by itself and 1 > ChatGPT: You're referring to prime numbers. The statement is essentially the definition of prime numbers. A prime number is a positive integer greater than 1 that is only divisible by 1 and itself. There is no need to prove this statement because it defines what prime numbers are. While there are infinitely many prime numbers, they can be listed, but this list will be finite in the sense that there are infinitely many prime numbers, but they can all be enumerated within an infinite sequence. Some of the first prime numbers are 2, 3, 5, 7, 11, 13, and so on. ChatGPT gets the first question right because something like that is undoubtedly in its training corpus, but it gets utterly confused by the second question (which is the same statement), to the point that its answer belongs to the "not even wrong" category. I don't know where this myth comes from that LLMs are magically good at maths. They're not. |
Secondly, quality is highly variant and there are traps the context window can fall into which causes especially bad results. Jeremy Howard has a great video (https://www.youtube.com/watch?v=jkrNMKz9pWU, starting at 18:05 the 'limitations and capabilities' section is only 13 minutes long) talking about how quality depends on: how you frame your prompts, model power (4 does a lot of stuff that 3.5 can't), and whether you're in a kind of "context trap" of repeated mistakes.
Of course, some people like to point out that if it's so "finicky" and variant, it is "dumb." Sure, if you like. I'm not interested in whatever definitions you're using those things, the objective and observable point is that given well-known prompting practices, LLMs can do something functionally equivalent to reasoning about novel problems, and more powerful ones can reason about more powerful and difficult things.
I re-phrased your prompt (instead of "prove a false thing" I made it like "decide whether this thing can exist, and prove your answer"). And added a little well-known boilerplate prompt sugar. It seems to have done a better job.
https://chat.openai.com/share/53214f0c-17f7-4a3d-95be-8fd676...