Hacker News new | ask | show | jobs
by hackeyed 3018 days ago
These steps are generally split between the image processing ones, the OCR one, and a final step of combining/compressing pages into a single "bound" pdf/djvu file, at least if you are looking to use FOSS software.

For image processing, take a look at Scantailor (https://github.com/scantailor/scantailor/wiki), which will handle all the image processing steps for you and output images that are ideal for OCR.

I have not done OCR on mixed language text but I will say that tesseract has been under active development for years and does continue to improve.

The best FOSS options for binding all the processed images and OCR output into single files are djvubind (https://github.com/strider1551/djvubind) for djvu output, and pdfbeads (https://github.com/ifad/pdfbeads) fr pdf output. I tried to write up an outline of the whole process and how to use each of the tools here: https://github.com/wikey/bookscan

A lot of those tools have received little development in the past couple of years. They tend to do what they do well and reliably so don't let that put you off, though anyone interested in adding to the developer pool would certainly be welcome.

For more general information and especially background discussion, take a look through the DIY Book Scanner forum: https://forum.diybookscanner.org/