Hacker News new | ask | show | jobs
by philipkglass 2302 days ago
14 years ago I used the personal edition of Abbyy FineReader to OCR about 400,000 scanned journal articles. It took me a few months.

The workflow was:

- Extract the page images as TIFF, and store the page ranges so I could map the page ranges back to the individual articles afterward.

- Concatenate a range of images one big file, with an upper limit of (IIRC) about 4000 pages. FR would start to generate weird errors when I made the files any bigger than this.

- Run OCR over the giant 4000 page file.

- Export the result as one big PDF with OCR text layer under the scanned pages.

- Split the PDF back into individual PDF files corresponding to articles, using the data I saved in step 1.

- Optimize the individual PDF article files for compact storage, using the Multivalent [1] optimizer.

I did this with a combination of FineReader -- the only paid software -- Python, Multivalent, AutoHotKey, and PDFtk.

I was living on a grad student stipend at the time so I optimized for spending the least amount of cash possible, at the cost of writing my own automation to replace the batch processing found in more expensive editions of FineReader.

The most time consuming part was dealing with weird one-off errors thrown by FR's OCR engine. I had to resolve them all manually. They were too varied and infrequent to be worth automating away.

I tried Acrobat's own OCR too before I resorted to FineReader, but it was pretty terrible. At the time it also appeared to make the PDF files significantly larger, which was weird since a text layer shouldn't take much additional storage.

[1] http://multivalent.sourceforge.net/