|
|
|
|
|
by mannykannot
2428 days ago
|
|
RcouF1uZ4gsC makes a compelling case for the results on this test to potentially be a significant caveat to the results, and also to the claims of achieving a near-human level of performance. If so, then why would you make such claims before you have these results? Or at least mention this caveat at the points where you are making the claim, such as in the abstract. |
|
> For SuperGLUE, we improved upon the state-of-the-art by a large margin (from an average score of 84.6 [Liu et al., 2019c] to 88.9). SuperGLUE was designed to comprise of tasks that were “beyond the scope of current state-of-the-art systems, but solvable by most college-educated English speakers” [Wang et al., 2019b]. We nearly match the human performance of 89.8 [Wang et al., 2019b]. Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting.
I'm not sure why the SuperGLUE/GLUE benchmark was designed to omit the AX-* scores from the benchmark score. It may be that they have no corresponding training set.