|
Abstract: Theory of mind (ToM), or the ability to impute unobservable mental states to others, is central to human social interactions, communication, empathy, self-consciousness, and morality. We administer classic false-belief tasks, widely used to test ToM in humans, to several language models, without any examples or pre-training. Our results show that models published before 2022 show virtually no ability to solve ToM tasks. Yet, the January 2022 version of GPT-3 (davinci-002) solved 70% of ToM tasks, a performance comparable with that of seven-year-old children. Moreover, its November 2022 version (davinci-003), solved 93% of ToM tasks, a performance comparable with that of nine-year-old children. These findings suggest that ToM-like ability (thus far considered to be uniquely human) may have spontaneously emerged as a byproduct of language models' improving language skills. |
What it suggests to me is that the particular test of “Theory of Mind” tasks involved actually test the ability to process language and generate appropriate linguistic results, not theory of mind.
It also suggests (with the “thus far considered to be uniquely human”) that the authors are unaware of other theory of mind tests that have been used that are not language dependent but behavior dependent, and on which, while, as is also true of linguistic tests, the validity of the tests is controversial – a number of non-human primates, non-primate mammals, and even some birds (parrots and corvids, particulary) have shown evidence of theory of mind.