| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by durnygbur 2270 days ago
	ImageMagick and Tesseract for OCR-ing each page of a PDF into a separate text file (through TIFF image format, disregard the huge TIFFs afterwards), private git repos for hosting, then ag/grep for searching. Not as easy to find the phrase back in PDF as with eg. Google Books, but then GB with copyright related content restrictions is useless most of the time.