Hacker News new | ask | show | jobs
by handsomechad 2270 days ago
do you have a link to this solution
1 comments

ImageMagick and Tesseract for OCR-ing each page of a PDF into a separate text file (through TIFF image format, disregard the huge TIFFs afterwards), private git repos for hosting, then ag/grep for searching. Not as easy to find the phrase back in PDF as with eg. Google Books, but then GB with copyright related content restrictions is useless most of the time.