|
|
|
|
|
by anodyne33
2302 days ago
|
|
Does anybody regularly use Acrobat's text extraction engine? I've had fine results as far as accuracy goes when compared to other OCR engines but one sticking point drives me nuts. My problem is, and I'm typically doing this in batches of thousands of files, if a PDF has a footer applied Acrobat sees that as renderable text and blows off the rest of the rest of the page. I've tried all manner of sanitizing, removing hidden information, saving as another PDF protocol and still can't get around the plain text footers/headers. In a perfect world I'd have unlimited Tesseract or ABBYY access but we're trying to do this on the cheap and I'm working with client data that I don't want to bang through Google. I'll have to poke at some of the open source tools mentioned so far, too. |
|
The workflow was:
- Extract the page images as TIFF, and store the page ranges so I could map the page ranges back to the individual articles afterward.
- Concatenate a range of images one big file, with an upper limit of (IIRC) about 4000 pages. FR would start to generate weird errors when I made the files any bigger than this.
- Run OCR over the giant 4000 page file.
- Export the result as one big PDF with OCR text layer under the scanned pages.
- Split the PDF back into individual PDF files corresponding to articles, using the data I saved in step 1.
- Optimize the individual PDF article files for compact storage, using the Multivalent [1] optimizer.
I did this with a combination of FineReader -- the only paid software -- Python, Multivalent, AutoHotKey, and PDFtk.
I was living on a grad student stipend at the time so I optimized for spending the least amount of cash possible, at the cost of writing my own automation to replace the batch processing found in more expensive editions of FineReader.
The most time consuming part was dealing with weird one-off errors thrown by FR's OCR engine. I had to resolve them all manually. They were too varied and infrequent to be worth automating away.
I tried Acrobat's own OCR too before I resorted to FineReader, but it was pretty terrible. At the time it also appeared to make the PDF files significantly larger, which was weird since a text layer shouldn't take much additional storage.
[1] http://multivalent.sourceforge.net/