| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by freethejazz 720 days ago
	Depending on how much structure you want to extract before passing the pdf contents to the next step in your pipeline, this paper[1] might be helpful in surfacing more options. It's a review/benchmark of numerous tools applied to the information extraction of academic documents. I haven't been through to evaluate the solutions they examined, but it's how I discovered GROBID and IMO lays out the strengths of each approach clearly. [1] https://arxiv.org/pdf/2303.09957