Hacker News new | ask | show | jobs
by sadiq 408 days ago
This is good though it's not clear whether these papers will appear in the PMC Open Access subset (https://pmc.ncbi.nlm.nih.gov/tools/openftlist/) and be bulk downloadable.

I've been doing some work with colleagues at Cambridge and Imperial over the last year on using LLMs to improve evidence synthesis, primarily trying to find papers on the effectiveness of certain Conservation interventions. It's becoming clear that you really need to move beyond screening papers only by title and abstract - there's often information buried deep within papers that can only be found with access to full text. My colleague Anil Madhavapeddy has written a bit about our adventures in trying to ingest full-text academic papers: https://anil.recoil.org/notes/uk-national-data-lib

2 comments

Yes, it depends on what you're doing; for general paper discovery / search tasks, title abstract can be enough (which is also why Springer and Elsevier have been pulling even their abstracts from sources like OpenAlex).

But for something like that you need full texts to look into results sections. I'm very curious how you're dealing with information contained in tables, or if you're dealing with snippets of text from the full-text alone. Have you poked around Elicit yet?

I've recently had this problem where the important information (number of study participants, and how many were filtered out during which step) were only encoded in figures, not in the text. Maddening.
Do you know of any ready to use alternatives to title and abstract screening? Wondering about it since I'm in the weeds of doing so.
what do you mean exactly? I was suprised how with grobid many of at least the arXiv papers are easily converted to xml for better processing than PDF.

Most of the papers are constructed from their latex sources so there's an easy way to undo it i guess.

https://github.com/kermitt2/grobid

grobid is a wonderful resource, patrice did an awesome job (I used it at my previous job at scite.ai)
that's exactly what I needed!
glad to hear!