|
|
|
|
|
by therobot24
582 days ago
|
|
i only performed a quick read of the paper but couldn't find how many humans they used to generate their expected human performance, this seems to be the main content: > To ensure that we did not overfit PaperQA2 to achieve high performance on LitQA2, we generated a new set of 101 LitQA2 questions after making most of the engineering changes to PaperQA2. The accuracy of PaperQA2 on the original set of 147 questions did not differ significantly from its accuracy on the latter set of 101 questions, indicating that our optimizations in the first stage generalized well to new and unseen LitQA2 questions (Table 2). > To compare PaperQA2 performance to human performance on the same task, human annotators who either possessed a PhD in biology or a related science, or who were enrolled in a PhD program (see Section 8.2.1), were each provided a subset of LitQA2 questions and a performance-related financial incentive of $3-12 per question to answer as many questions correctly as possible within approximately one week, using any online tools and paper access provided by their institutions. Under these conditions, human annotators achieved 73.8% ± 9.6% (mean ± SD, n = 9) precision on LitQA2 and 67.7% ± 11.9% (mean ± SD, n = 9) accuracy (Figure 2A, green line). PaperQA2 thus achieved superhuman precision on this task (t(8.6) = 3.49, p = 0.0036) and did not differ significantly from humans in accuracy (t(8.5) = −0.42, p = 0.66). |
|