|
Do you have ideas for what would make a better experiment? The methodology for a literature search comparison, while simple, is the best I could come up with. We developed ~250 multiple choice questions which require a deep dive into a paper to answer, ideally with very convincing distractor answers. Then we gave 9 evaluators (post-docs and grad students in biology) a week to answer 40 questions each, without any limitations on their search. The evaluators were incentivized by providing a base pay per question completed, with a 50-100% bonus if they got enough questions correct. Under those circumstances, the evaluators had an answer precision of 73.8%, and the AI system (PaperQA2) was 85.2%. Both the evaluators and PaperQA2 could choose not to answer on a particular question. If you look at accuracy, which takes into account not answering a question, evaluators were 67.7% and PaperQA2 was 66%. So in terms of overall accuracy -- humans still did a touch better. But when actually answering, the AI was more precise. In terms of literature synthesis comparison, I think the methodology was pretty solid too, but would love more feedback. We had PaperQA2 write cited articles for ~19k human genes, of which there are (non-stub) Wikipedia articles for ~3.9k. It's worth noting that this is a particularly technical subset of Wikipedia articles. We sampled 300 articles that were in both sources, then extracted 500 statements from each (basically a paragraph block). These statements could be compound, or even multi-sentence statements. These statements were shuffled and obfuscated such that the origin could not be determined from the statement alone. The statements were given to a team of 4 evaluators, who were each asked to evaluate if the information was correct as cited, i.e. did the source actually support the statement. So they had to access (if they could) and actually read all the sources. After we got the evaluator gradings back, we could compile and map each statement back to its origin for comparison. Under these circumstance, the PaperQA2 written articles were 83% cited and supported, while the Wikipedia articles were 61.5% cited and supported. Wikipedia had comparatively more uncited claims, so if we eliminate those and only focus on the cited claims themselves, then PaperQA2 had 86.1% of claims that were supported by the source and Wikipedia had 71.2%. We did an analysis of every single un-supported claim, and on Wikipedia, claims are often attributed to arbitrary or really broad sources, like a landing page to a database. (here's the paper fwiw: https://arxiv.org/abs/2409.13740) |