| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vram22 3442 days ago

I had used the xpdf [1] package, a C library and a set of CLI tools (mentioned by others in this thread too, and which the pdftotext command-line utility and xppdf/pdftotext library are a part of), in a consulting project for a client some years ago. (Client had asked me to evaluate some libraries for PDF text extraction, and then recommend one, which I did (I chose xpdf), and I then consulted to them on their product, using xtpdf for part of the work. Also did some post-processing of the extracted text in Python. Interesting project, overall.)

As part of this work, I communicated over a period, with one of the key technical people at the company behind xpdf, Glyph and Cog. Got to know from him about some of the issues with text extraction from PDF, one of the key points being that in some or many cases, the extraction can be imperfect or incomplete, due to factors inherent in the PDF format itself, and its differences from text format. PDFTextStream (for Java) is another one which I had heard of, from someone I know personally, who said it was quite good. But those inherent issues of text extraction do exist.

So wherever possible, a good option is to go to the source from which the PDF was originally generated, instead of trying to reverse-engineer it, and get the text you want from there. Not always possible, of course, but a preferred approach, particularly for cases where maximum accuracy of text extraction is desired.

[1] Not to be confused with xtopdf, my PDF toolkit for PDF generation from other formats.