|
|
|
|
|
by patrick-lewis
2119 days ago
|
|
Hi, First author here. Thanks for the comment and for adding examples, and for your nuanced comments of the answer overlap split. My position is that these datasets are still useful for QA, but what was lacking was an analysis of how easy/hard the questions in them were, and what kind of modelling was needed to do well. These overlap phenomena are less like "bugs" maybe, but more like poorly understood features. We need models that can accurately recall QA pairs they have seen before, so being able to score well on "memorizable" QA pairs is still important to do well, but we also want models that can do more than that.
One single accuracy number on a leaderboard cannot capture all the behavioural information we need to properly understand the capabilities of these models. |
|