Hacker News new | ask | show | jobs
by RcouF1uZ4gsC 2428 days ago
I think classifying this as human level is misleading.

Look at the sub-scores on the page. One score that looks very different from humans is AX-b.

The SuperGlue paper provides more context about AX-b

https://arxiv.org/pdf/1905.00537.pdf

AX-b "is the broad-coverage diagnostic task, scored using Matthews’ correlation (MCC). "

This is how the paper describes this test

" Analyzing Linguistic and World Knowledge in Models GLUE includes an expert-constructed, diagnostic dataset that automatically tests models for a broad range of linguistic, commonsense, and world knowledge. Each example in this broad-coverage diagnostic is a sentence pair labeled with a three-way entailment relation (entailment, neutral, or contradiction) and tagged with labels that indicate the phenomena that characterize the relationship between the two sentences. Submissions to the GLUE leaderboard are required to include predictions from the submission’s MultiNLI classifier on the diagnostic dataset, and analyses of the results were shown alongside the main leaderboard. Since this broad-coverage diagnostic task has proved difficult for top models, we retain it in SuperGLUE. However, since MultiNLI is not part of SuperGLUE, we collapse contradiction and neutral into a single not_entailment label, and request that submissions include predictions on the resulting set from the model used for the RTE task. We collect non-expert annotations to estimate human performance, following the same procedure we use for the main benchmark tasks (Section 5.2). We estimate an accuracy of 88% and a Matthew’s correlation coefficient (MCC, the two-class variant of the R3 metric used in GLUE) of 0.77. "

If you look at the scores, humans are estimated to score 0.77. Google T5 scores -0.4 on the test.

How did T5 get such a high score if it scored so abysmally on the AX-b test?

The AX scores are not included in the total score.

From the paper: "The Avg column is the overall benchmarkscore on non-AX∗ tasks."

If the AX scores were included, the gap between humans and machines would be bigger than the current score indicates.

2 comments

Hi, one of the paper's authors here. We didn't submit our model's predictions for the AX-b task yet, we just copied over the predictions from the example submission. We will submit predictions for AX-b in the next few days.
RcouF1uZ4gsC makes a compelling case for the results on this test to potentially be a significant caveat to the results, and also to the claims of achieving a near-human level of performance. If so, then why would you make such claims before you have these results? Or at least mention this caveat at the points where you are making the claim, such as in the abstract.
To be clear, here is the claim we make in the paper (we did not write the title of this post to HN):

> For SuperGLUE, we improved upon the state-of-the-art by a large margin (from an average score of 84.6 [Liu et al., 2019c] to 88.9). SuperGLUE was designed to comprise of tasks that were “beyond the scope of current state-of-the-art systems, but solvable by most college-educated English speakers” [Wang et al., 2019b]. We nearly match the human performance of 89.8 [Wang et al., 2019b]. Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting.

I'm not sure why the SuperGLUE/GLUE benchmark was designed to omit the AX-* scores from the benchmark score. It may be that they have no corresponding training set.

My mistake - I had overlooked the AX-* scores being expressly omitted from these benchmarks. Maybe it is possible, then, that they could provide the additional headroom for further research?

Regardless of the status of the AX-* tests, I am very impressed by your results on the SuperGLUE benchmark.

I find it strange that they exclude it? Perhaps the reason is related?