| I think classifying this as human level is misleading. Look at the sub-scores on the page. One score that looks very different from humans is AX-b. The SuperGlue paper provides more context about AX-b https://arxiv.org/pdf/1905.00537.pdf AX-b "is the broad-coverage diagnostic task, scored using
Matthews’ correlation (MCC). " This is how the paper describes this test "
Analyzing Linguistic and World Knowledge in Models GLUE includes an expert-constructed,
diagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and
world knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with a three-way entailment relation (entailment, neutral, or contradiction) and tagged with labels that
indicate the phenomena that characterize the relationship between the two sentences. Submissions
to the GLUE leaderboard are required to include predictions from the submission’s MultiNLI
classifier on the diagnostic dataset, and analyses of the results were shown alongside the main
leaderboard. Since this broad-coverage diagnostic task has proved difficult for top models, we retain
it in SuperGLUE. However, since MultiNLI is not part of SuperGLUE, we collapse contradiction
and neutral into a single not_entailment label, and request that submissions include predictions
on the resulting set from the model used for the RTE task. We collect non-expert annotations to
estimate human performance, following the same procedure we use for the main benchmark tasks
(Section 5.2). We estimate an accuracy of 88% and a Matthew’s correlation coefficient (MCC, the
two-class variant of the R3 metric used in GLUE) of 0.77.
" If you look at the scores, humans are estimated to score 0.77. Google T5 scores -0.4 on the test. How did T5 get such a high score if it scored so abysmally on the AX-b test? The AX scores are not included in the total score. From the paper: "The Avg column is the overall benchmarkscore on non-AX∗ tasks." If the AX scores were included, the gap between humans and machines would be bigger than the current score indicates. |