Hacker News new | ask | show | jobs
by tyingq 3399 days ago
Curious if this works better than the pdftotext utility that comes in the Debian poppler-utils package.

That has a --layout option that works really well sometimes and really terrible other times. Doesn't seem to be related to document complexity either.

5 comments

I had used the xpdf [1] package, a C library and a set of CLI tools (mentioned by others in this thread too, and which the pdftotext command-line utility and xppdf/pdftotext library are a part of), in a consulting project for a client some years ago. (Client had asked me to evaluate some libraries for PDF text extraction, and then recommend one, which I did (I chose xpdf), and I then consulted to them on their product, using xtpdf for part of the work. Also did some post-processing of the extracted text in Python. Interesting project, overall.)

As part of this work, I communicated over a period, with one of the key technical people at the company behind xpdf, Glyph and Cog. Got to know from him about some of the issues with text extraction from PDF, one of the key points being that in some or many cases, the extraction can be imperfect or incomplete, due to factors inherent in the PDF format itself, and its differences from text format. PDFTextStream (for Java) is another one which I had heard of, from someone I know personally, who said it was quite good. But those inherent issues of text extraction do exist.

So wherever possible, a good option is to go to the source from which the PDF was originally generated, instead of trying to reverse-engineer it, and get the text you want from there. Not always possible, of course, but a preferred approach, particularly for cases where maximum accuracy of text extraction is desired.

[1] Not to be confused with xtopdf, my PDF toolkit for PDF generation from other formats.

During the development I compared my results with the ones of pdftotext utility and i obtained more or less similar results. The objective of my code was to have an equivalent tool easily embeddable in any java/android project and to learn more about apache pdfbox.
I imagine it's not an easy task guessing about proportionally spaced fonts, overlapping bounding boxes, columns, tables, wrapping, and so forth.
yes, definitely not easy but fortunately pdfbox offers a solid base to start with.
It probably works reasonably well with the documents it has been tested with. It's a very hard problem to crack if you ask me. (edit: word choice)
Also available for windows and mac at http://www.foolabs.com/xpdf/download.html
Last year, my boss gave me a task that looked simple enough at first glance - get data on how many vacation days each employee has in total, how many they have used in the current year, and how many they have left, and put that data in our SharePoint server (so people can see when filling out a vacation request if they actually have enough days left).

Most of that was fairly easy, except that the POS program that sits in the actual data only allows exporting data in one single format - PDF. Converting that PDF file to a CSV that I can feed into SharePoint was one of the nastiest things I did last year. I did manage to get it to work though, by toying around with pdftotext for a while and exploring its command line parameters.

It was a pleasure to use! It took me a while to discover the correct set of command line parameters I needed, but I got it to work! Thanks, xpdf!

Had several somewhat similar experiences in my career. I think the general public would be surprised at the amount of duct tape and chewing gum that's behind things that appear to be important processes.
pdftotext from xpdf (http://www.foolabs.com/xpdf/download.html) also has the -table option which usually works better than -layout. Unfortunately the poppler-utils fork doesn't have this option.