Hacker News new | ask | show | jobs
by mytddu 3600 days ago
My gripe with PDF is that it's the standard format for academic publishing, rendering a whole mass of scientific knowledge largely inaccessible for text processing purposes. I've wanted to analyze the Libgen archive of journal articles for a long time but have never found an adequate solution for extracting text from PDFs. Any suggestions on this?
3 comments

Sure, the Linux tool "pdftotext" works just fine for this. Two small caveats: ligatures get converted to proper Unicode ligatures and not their ASCII fallback (as one might want or expect) and of course complex mathematical formulas are rendered badly.
I've tried both pdftotext and pdf2txt and I remember not being satisfied with either. Neither seem to handle non-ASCII characters very well, but I'll take another look soon though.