Hacker News new | ask | show | jobs
by freethejazz 673 days ago
Depending on how much structure you want to extract before passing the pdf contents to the next step in your pipeline, this paper[1] might be helpful in surfacing more options. It's a review/benchmark of numerous tools applied to the information extraction of academic documents. I haven't been through to evaluate the solutions they examined, but it's how I discovered GROBID and IMO lays out the strengths of each approach clearly.

[1] https://arxiv.org/pdf/2303.09957