| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by patrick-lewis 2166 days ago

Hi, First author here.

Thanks for the comment and for adding examples, and for your nuanced comments of the answer overlap split.

My position is that these datasets are still useful for QA, but what was lacking was an analysis of how easy/hard the questions in them were, and what kind of modelling was needed to do well. These overlap phenomena are less like "bugs" maybe, but more like poorly understood features.

We need models that can accurately recall QA pairs they have seen before, so being able to score well on "memorizable" QA pairs is still important to do well, but we also want models that can do more than that. One single accuracy number on a leaderboard cannot capture all the behavioural information we need to properly understand the capabilities of these models.