|
|
|
|
|
by tyingq
3399 days ago
|
|
Curious if this works better than the pdftotext utility that comes in the Debian poppler-utils package. That has a --layout option that works really well sometimes and really terrible other times. Doesn't seem to be related to document complexity either. |
|
As part of this work, I communicated over a period, with one of the key technical people at the company behind xpdf, Glyph and Cog. Got to know from him about some of the issues with text extraction from PDF, one of the key points being that in some or many cases, the extraction can be imperfect or incomplete, due to factors inherent in the PDF format itself, and its differences from text format. PDFTextStream (for Java) is another one which I had heard of, from someone I know personally, who said it was quite good. But those inherent issues of text extraction do exist.
So wherever possible, a good option is to go to the source from which the PDF was originally generated, instead of trying to reverse-engineer it, and get the text you want from there. Not always possible, of course, but a preferred approach, particularly for cases where maximum accuracy of text extraction is desired.
[1] Not to be confused with xtopdf, my PDF toolkit for PDF generation from other formats.