| As someone working in the field, I congratulate the excellent accomplishment but agree with the authors that we shouldn't get too excited yet (their quote below after the four reasons). Here are some reasons: 1) Most likely, the model is still susceptible to adversarial triggers as demonstrated on other systems here: http://www.ericswallace.com/triggers 2) T5 was trained with ~750GB of texts or ~150 billion words, which is > 100 times the number of words native English speakers acquire by the age of 20. 3) Most or all of the tests are multiple-choice. Learning complex correlations from sufficient data should help solve most of them. This is useful but human-level understanding is more than correlations. 4) The performance on datasets that require commonsense knowledge, COPA and WSC, are the weakest relative to humans (who score 100.0 on both). Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer, p.32
https://arxiv.org/pdf/1910.10683.pdf "Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting." I’d like to emphasize that the work and the paper are excellent. Still, we are quite far from human-level language understanding. --- We may need more advanced tests to probe the actual language understanding ability of AI systems. Here are some ideas: * Test for conceptual understanding in a non-multiple-choice format. Example: Write a summary for a New Yorker article, rather than standard news pieces (which tend to follow repeated patterns). * Commonsense test with longer chains of inference than those needed for solving Winograd Schema and set in non-standard situations (e.g. fantasy world). This should greatly reduce the chance that an approach can simply detect correlations from huge datasets. * Understanding novel, creative metaphors like those used in some essays by professional writers or some of the Economist's title articles. |
Structuring a problem as a multiple choice task is basically turning it into a classification problem, but it doesn't really answer the question everyone wants answered: is it really possible to reduce the problem of language understanding to classification? i.e. is it really possible to understand human language with no other ability than the ability to identify the classes of objects?
But that is a question that has to be answered before any performance on benchmarks that reduce language understanding to classification can be appraised correctly. If accurate classification is not sufficient for language understanding, then beating benchmarks like SuperGLUE tells us nothing new (we already know we have good classifiers).
The problem here is that we have no good measures of language understanding, of humans or machines- because we have a poor, er, understanding of our own language ability. Until we know more about what it means to understand language it won't be possible to evaluate automated language understanding systems very well.
Hopefully though, the skepticism I've observed around results like the one above, will lead to a renewed effort to research our language ability, and perhaps our intelligence in general.