| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mytddu 3600 days ago
	My gripe with PDF is that it's the standard format for academic publishing, rendering a whole mass of scientific knowledge largely inaccessible for text processing purposes. I've wanted to analyze the Libgen archive of journal articles for a long time but have never found an adequate solution for extracting text from PDFs. Any suggestions on this?

3 comments

IngoBlechschmid 3600 days ago

Sure, the Linux tool "pdftotext" works just fine for this. Two small caveats: ligatures get converted to proper Unicode ligatures and not their ASCII fallback (as one might want or expect) and of course complex mathematical formulas are rendered badly.

link

mytddu 3600 days ago

I've tried both pdftotext and pdf2txt and I remember not being satisfied with either. Neither seem to handle non-ASCII characters very well, but I'll take another look soon though.

link

nitrogen 3600 days ago

Pdf2txt wasn't helpful?

http://manpages.ubuntu.com/manpages/precise/man1/pdf2txt.1.h...

link

based2 3599 days ago

https://pdfbox.apache.org/

link