Hacker News new | ask | show | jobs
by bgia 3212 days ago
Extracting data from PDF in a reliable way.
4 comments

Extracting at least text from PDFs is not always 100% perfect, due to inherent issues with the PDF format (partly because it is a graphic format, and does not have a one-to-one mapping to text, also maybe because of some weird decisions they made). I both read about this and was told about this by a key person at a PDF software product company, whose product I researched and then used in a project. The product was xpdf (a C library, it also had binaries or EXEs), from Glyph and Cog. I was contracted by a client to research PDF libraries for extraction of text from PDF; found and evaluated a few, then recommended xpdf to the client, and used it in the project. That is how I know this.

The only guaranteed way to get 100% accurate text from PDF is ... to not do it :) Instead, get the text from the same source that is used to generate the PDF. Obviously, that will not always be possible, but when it is, it is the better solution.

We do a decent amount of work in this, lots of scraping the web and extracting. Unfortunately we dont do it against pdfs that have any strict format or even a loose format at all, I wish it were government forms or any type of forms. Would you say the pdfs you guys are looking at have some type of format and the readers are just hit or miss?
I am a freelancer doing exactly this. Drop me a line if interested.
Would this help <https://www.pdfdata.io/>?