| HN Mirror

Hi, one of the paper's authors here. We didn't submit our model's predictions for the AX-b task yet, we just copied over the predictions from the example submission. We will submit predictions for AX-b in the next few days.

RcouF1uZ4gsC makes a compelling case for the results on this test to potentially be a significant caveat to the results, and also to the claims of achieving a near-human level of performance. If so, then why would you make such claims before you have these results? Or at least mention this caveat at the points where you are making the claim, such as in the abstract.

To be clear, here is the claim we make in the paper (we did not write the title of this post to HN):

> For SuperGLUE, we improved upon the state-of-the-art by a large margin (from an average score of 84.6 [Liu et al., 2019c] to 88.9). SuperGLUE was designed to comprise of tasks that were “beyond the scope of current state-of-the-art systems, but solvable by most college-educated English speakers” [Wang et al., 2019b]. We nearly match the human performance of 89.8 [Wang et al., 2019b]. Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting.

I'm not sure why the SuperGLUE/GLUE benchmark was designed to omit the AX-* scores from the benchmark score. It may be that they have no corresponding training set.

My mistake - I had overlooked the AX-* scores being expressly omitted from these benchmarks. Maybe it is possible, then, that they could provide the additional headroom for further research?

Regardless of the status of the AX-* tests, I am very impressed by your results on the SuperGLUE benchmark.

I find it strange that they exclude it? Perhaps the reason is related?