| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MassPikeMike 487 days ago

I maintain a searchable archive of historical documents for a nonprofit, OCR'd with Tesseract over several years. Tesseract 4 was a big improvement over previous versions, but since then its accuracy has not improved at the same rate as other free solutions.

These days, just uploading a PDF of scanned documents (typeset ones, not handwriting) to Google Drive and opening with Google Docs results in a text document generated with impressive quality OCR.

But this is not scriptable, and doesn't provide access position information, which is needed so we can highlight search results as color overlays on the original PDF. Tesseract's hOCR mode was great for that.

For the next version, we're planning to use one of the command-line wrappers to Apple's Vision framework, which is included free in MacOS. A nice one that provides position information is at https://github.com/bytefer/macos-vision-ocr

1 comments

thenthenthen 487 days ago

I asked this question yesterday but did not enough votes. I need to OCR and then translate thousands of pages from historical documents and was wondering if you knew a scriptable app/technique or technology that includes ‘layout recovery’, aka overlaying translated text over the original, like the Safari browser etc. does (not sure the apple vision framework wrapper does this?).

link

MassPikeMike 487 days ago

Apple Vision and its wrappers provide bounding boxes for each line of text. That's slightly less convenient than Tesseract which can give you a bounding box for each word, but more than compensated by Apple Vision's better accuracy. I am planning to fudge the word boxes by assuming fixed-width letters and dividing up the overall width so that each word's width is proportional to its share of the total letters on the line.

Once you have those bounding boxes, it's pretty simple to use a library like [1] (Python) or [2] (JavaScript) to add overlay text in the right place. For example, see how [3] does it.

[1] https://pymupdf.readthedocs.io/en/latest/recipes-text.html#h... [2] https://github.com/foliojs/pdfkit [3] https://github.com/eloops/hocr2pdf

link

thenthenthen 486 days ago

Thank you, I will take a look at hocr2pdf to see how they overlay text!

link

wahnfrieden 487 days ago

FYI the Apple one is best inside the Live Text API which is Swift-only and so some old Python and CLI tools which wrap the older Obj-C APIs may have worse quality (though Live Text doesn't really provide bounding boxes - so what I do is combine its output with bounding box APIs like the old iOS/macOS ones)

link