|
|
|
|
|
by MassPikeMike
487 days ago
|
|
I maintain a searchable archive of historical documents for a nonprofit, OCR'd with Tesseract over several years. Tesseract 4 was a big improvement over previous versions, but since then its accuracy has not improved at the same rate as other free solutions. These days, just uploading a PDF of scanned documents (typeset ones, not handwriting) to Google Drive and opening with Google Docs results in a text document generated with impressive quality OCR. But this is not scriptable, and doesn't provide access position information, which is needed so we can highlight search results as color overlays on the original PDF. Tesseract's hOCR mode was great for that. For the next version, we're planning to use one of the command-line wrappers to Apple's Vision framework, which is included free in MacOS. A nice one that provides position information is at https://github.com/bytefer/macos-vision-ocr |
|