Hacker News new | ask | show | jobs
by vram22 3221 days ago
Extracting at least text from PDFs is not always 100% perfect, due to inherent issues with the PDF format (partly because it is a graphic format, and does not have a one-to-one mapping to text, also maybe because of some weird decisions they made). I both read about this and was told about this by a key person at a PDF software product company, whose product I researched and then used in a project. The product was xpdf (a C library, it also had binaries or EXEs), from Glyph and Cog. I was contracted by a client to research PDF libraries for extraction of text from PDF; found and evaluated a few, then recommended xpdf to the client, and used it in the project. That is how I know this.

The only guaranteed way to get 100% accurate text from PDF is ... to not do it :) Instead, get the text from the same source that is used to generate the PDF. Obviously, that will not always be possible, but when it is, it is the better solution.