| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by AlphaGeekZulu 985 days ago
	Note: AS far as I know, Calibre does not do OCR, so a PDF with only scanned content will not work.

2 comments

ggpsv 985 days ago

I've had good luck using Tesseract [0] for scanned PDFs. If you're not CLI-inclined, there are several GUIs for it available [1]. I have had good luck downloading scanned PDFs from archive.org and running them through Tesseract.

Did not know about Calibre for this - I was relying on opening each search and searching it individually.

[0]: https://github.com/tesseract-ocr/tesseract [1]: https://www.opait.com/tessstudio/

link

kristofferR 985 days ago

OCRmyPDF is a tool using Tesseract, specifically designed for PDFs. I would recommend that over pure Tesseract.

https://github.com/ocrmypdf/OCRmyPDF

link

kristofferR 985 days ago

I recommend running any such PDFs through OCRmyPDF.

https://github.com/ocrmypdf/OCRmyPDF

link