| HN Mirror

Curious as well. About a year ago I was implementing what I thought naively might not be a very difficult verification that a specific string existed (case sensitive or insensitive) within a PDF's text and had many cases where text viewed was clearly rendered in the document but many libraries couldn't identify the text. It's my understanding there's a lot of variance in how a rendered PDF may be presenting something one may assume is a simple string that really isn't after going down the rabbit hole (wasn't too surprising because I dont like to make simplicity assumptions). I couldn't find anything at the time that seemed to be error free.

Aside from applying document rendering with OCR and text recognition approaches, I ended up living with some error rate there. I think PDFgrep was one of the libraries I tested. Some other people just used libraries/tools as is with no sort of QAing but from my sample applying to several hundred verified documents, pdfgrep (and others) missed some.