Hacker News new | ask | show | jobs
by AlphaGeekZulu 937 days ago
Note: AS far as I know, Calibre does not do OCR, so a PDF with only scanned content will not work.
2 comments

I've had good luck using Tesseract [0] for scanned PDFs. If you're not CLI-inclined, there are several GUIs for it available [1]. I have had good luck downloading scanned PDFs from archive.org and running them through Tesseract.

Did not know about Calibre for this - I was relying on opening each search and searching it individually.

[0]: https://github.com/tesseract-ocr/tesseract [1]: https://www.opait.com/tessstudio/

OCRmyPDF is a tool using Tesseract, specifically designed for PDFs. I would recommend that over pure Tesseract.

https://github.com/ocrmypdf/OCRmyPDF

I recommend running any such PDFs through OCRmyPDF.

https://github.com/ocrmypdf/OCRmyPDF