| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by astrange 1828 days ago
	How well does text extraction from a PDF work? I almost never try it but thought there were random spaces in the output and such things.

2 comments

hangsi 1828 days ago

A fair summary would be "often very well, but not always". A good exmaple would be the S2ORC dataset [0]: a dataset of full parses of scientific PDFs. In their paper, the authors write about the difficulties of getting the parsers to work reliably, and how having multiple published versions of a PDF was helpful for when the PDF parser fails on the first one.

[0] https://allenai.org/data/s2orc

link

moyix 1827 days ago

It's worth noting that for most papers, arXiv provides the LaTeX source for download, which is presumably what they trained on.

link