Hacker News new | ask | show | jobs
by moonshotideas 1370 days ago
Out of curiosity, how did you solve the issue of extracting text from the pdf, error free? Or did you use another package?
2 comments

Looking at the list of dependencies, it seems like they use poppler-cpp to render the PDFs.

https://gitlab.com/pdfgrep/pdfgrep#dependencies

Popper tools pdftotext -layout is great
Curious as well. About a year ago I was implementing what I thought naively might not be a very difficult verification that a specific string existed (case sensitive or insensitive) within a PDF's text and had many cases where text viewed was clearly rendered in the document but many libraries couldn't identify the text. It's my understanding there's a lot of variance in how a rendered PDF may be presenting something one may assume is a simple string that really isn't after going down the rabbit hole (wasn't too surprising because I dont like to make simplicity assumptions). I couldn't find anything at the time that seemed to be error free.

Aside from applying document rendering with OCR and text recognition approaches, I ended up living with some error rate there. I think PDFgrep was one of the libraries I tested. Some other people just used libraries/tools as is with no sort of QAing but from my sample applying to several hundred verified documents, pdfgrep (and others) missed some.