| GPT4 API and ChatGPT both get it wrong:
"Sally has 2 sisters. Each of her brothers has 2 sisters because Sally and her 2 sisters make up the group of siblings each brother has." GPT4 w/ CoT system prompting gets it right: SYS: "You are a helpful assistant. Think through your work step by step before providing your answer." USER: "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" ASSIST: "Let's break down the information given: 1. Sally has 3 brothers.
2. Each brother has 2 sisters. Given that Sally is a girl, she is one of the sisters. Since each brother has 2 sisters and we know Sally is one of them, there must be one more sister to make the statement true for each brother. Therefore, Sally has 1 sister." The importance of prompting makes it quite difficult to compare model peak performance. Especially since different models have different styles of prompts that generate peak performance. |
"Sally has 3 brothers, and each of them has 2 sisters. The description implies that Sally's brothers are her only siblings. Therefore, the two sisters each brother has must be Sally and one other sister. This means Sally has just one sister."