| Marcus might be biased but I don't think you're giving a good refutation, because the fact that GPT-3 gets a lot of things right probabilistically doesn't compensate for the fact that it's not actually understanding what's going on at a semantic level. It's a little bit like some sort of Chinese room, or asking a non-developer to answer you programming questions by looking like something that vaguely resembles your prompt and then picking the most upvoted answer on stackoverflow. Do they maybe give reasonable answers seven out of ten times or close enough on a good day? Yeah, can they program or even understand the question? No. And this is Marcus point which is fundamentally correct. It's really besides the point to point to successes, its the long tail of failures that show where the problem is. You can argue for a long time about the setup of some of these questions, but just to pick maybe the simplest one from the article "Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?" GPT-3: "I have a lot of clothes" Someone who actually understands what's going on doesn't produce output like this. Never, because reasoning here is not probabilistic. It's not about word tokens or continuations but understanding the objects that the words represent and their relationship in the world at a deep, principled level. Which GPT-3 does not do. The fact that some good answers create that appearance does not change that fact. |
Except this isn't how it works. We know it can't be, because GPT-3 can do simple math, despite math being vastly harder with GPT-3's byte pair encoding (it doesn't use base-N, but some awful variable-length compressed format). These dismissals don't hold up to the evidence.
> GPT-3: "I have a lot of clothes"
Most people don't write “Yesterday I dropped my clothes off at the dry cleaner’s and I have yet to pick them up. Where are my clothes?” as a way to quiz themselves in the middle of a paragraph. The answer “At the dry cleaner's.” might be the answer you want, but it's a pretty contrived way of writing.
GPT-3 isn't answering your question, it's continuing your story. If you want it to give straight answers, rather than build a narrative, prompt it with a Q&A format and ask it explicitly.
Further, GPT-3's answers are literally chosen randomly, due to the high temperature and no best-of. You cannot select one answer out of a large such N to demonstrate that its assigned probabilities are bad, because that cherry-picking will naturally search for GPT-3's least favourable generations.