|
|
|
|
|
by amelius
486 days ago
|
|
I didn't have good results in tesseract, so I hope this is really different ;) I was surprised that even scraped screen text did not work 100% flawlessly in tesseract. Maybe it was not made for that, but still, I had a lot of problems with high resolution photos also. I did not try scanned documents, though. |
|
In the mid 2010s I put Tesseract, OCRad (which is decidedly not state of the art), and aspell into a pretty effective text processing pipeline to transform resumes into structured documents. The commercial solutions we looked at (at the time) were a little slower and about as good. If the spellcheck came back with too low of a success rate I ran the document through OCRad which, while simplistic, sometimes did a better job.
I expect the results today with more modern projects to be much better so I probably wouldn’t go that path again. However as all of it runs nicely on slow hardware, it likely still has a place on low power/hobby grade IoT boards and other niches.