| I also tried this out and the balloon example as well. User: Barbara, Mary, and Harry had red balloons. Ted had a blue balloon. Sam also had a red balloon. How many children have red balloons? ChatGPT (GPT-4): Barbara, Mary, Harry, and Sam each had a red balloon. That means there are 4 children with red balloons. User: What about Ted? ChatGPT (GPT-4): Ted had a blue balloon, not a red one. So, he is not counted among the children with red balloons. Therefore, the total number of children with red balloons remains 4. User: Why do you think I asked my original question? ChatGPT (GPT-4): It seems that you asked your original question to test my ability to understand and process information provided in a text. You wanted to see if I could correctly identify and count the number of children with red balloons based on the details given. User: Can you explain what mistakes a simpler model may have made on this question? ChatGPT (GPT-4): A simpler language model may have made the following mistakes when answering your original question: ... I like trying these sort of puzzles and "gotcha" prompts and I've found GPT-3.5 and lower versions are indeed not hard to confuse, but it's much harder to find simple examples that confuse GPT-4. Often times GPT-4 will even be able to explain the puzzle and why it's interesting. I think this is part of why GPT-4 feels qualitatively different. I understand (roughly) how transformers work, but GPT-4 does seem to have some emergent ability to reason, at least in the sense of making short chains of logical inferences, and this does seem to inform text completions not just on these sort of puzzles. That said I do think there's real danger of overconfidence and over reliance. |