Hacker News new | ask | show | jobs
by aglionby 2292 days ago
I spent some time extracting abstracts from NLP papers (ACL conferences) and it was mostly straightforward. Using pdfquery to extract PDF -> XML gave each character as an element, and they were mostly ordered sensibly and grouped into paragraphs.

However... this didn't work in some cases, mainly with formatted text but sometimes with PDFs that looked like they were compiled in some nonstandard way. As a result I ended up chucking the XML structure entirely and recompiling the text from character-level coordinates. Formatted text was also an issue, with slightly offset y coordinates from regular characters on the same line.

I'm not sure I could take this experience and say that extracting _all text_ would be straightforward. Hopefully for most documents the XML is nicely structured, but I imagine there are many more opportunities for inconsistencies in how the PDF is generated when thinking about diagrams, tables etc. rather than just abstracts.

Considered writing up a blog post about my experiences with the above but imagined that it was far too niche. Code's here [1] if it's of interest.

[1] https://gist.github.com/GuyAglionby/4b55d00803710f2e2e9877fd...