Hacker News new | ask | show | jobs
by astrange 1828 days ago
How well does text extraction from a PDF work? I almost never try it but thought there were random spaces in the output and such things.
2 comments

A fair summary would be "often very well, but not always". A good exmaple would be the S2ORC dataset [0]: a dataset of full parses of scientific PDFs. In their paper, the authors write about the difficulties of getting the parsers to work reliably, and how having multiple published versions of a PDF was helpful for when the PDF parser fails on the first one.

[0] https://allenai.org/data/s2orc

It's worth noting that for most papers, arXiv provides the LaTeX source for download, which is presumably what they trained on.