| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mometsi 530 days ago
	> How is this different from tesseract and friends? The workflow is for digitizing historical printed documents. Think conserving old announcements in blackletter typesetting, not extracting info from typewritten business documents.

2 comments

amelius 530 days ago

I didn't have good results in tesseract, so I hope this is really different ;)

I was surprised that even scraped screen text did not work 100% flawlessly in tesseract. Maybe it was not made for that, but still, I had a lot of problems with high resolution photos also. I did not try scanned documents, though.

link

Moto7451 530 days ago

I have never had to handle handwriting professionally but I have had great success with Tesseract in the past. I’m sure it’s no longer the best free/cheap option but with a little bit of image pre-processing to ensure the text pops from the background and isn’t unnecessarily large (I.e. that 1200dpi scan is overkill) you can have a pretty nice pipeline with good results.

In the mid 2010s I put Tesseract, OCRad (which is decidedly not state of the art), and aspell into a pretty effective text processing pipeline to transform resumes into structured documents. The commercial solutions we looked at (at the time) were a little slower and about as good. If the spellcheck came back with too low of a success rate I ran the document through OCRad which, while simplistic, sometimes did a better job.

I expect the results today with more modern projects to be much better so I probably wouldn’t go that path again. However as all of it runs nicely on slow hardware, it likely still has a place on low power/hobby grade IoT boards and other niches.

link

spigottoday 530 days ago

I have a typewriter written manuscript that is interspersed with hand written editing. Tesseract worked fine until the hand written part, then garbage. Is there a local solution that anyone can recommend? I have a 16gb lenovo laptop and access to a workstation with a with an RTX 4070 ti 16gb card. Thanks.

link

bonefolder 530 days ago

Tangentially related, but does someone know a resource for high-quality scans of documents in blackletter / fraktur typesetting? I'm trying to convert documents to look fraktury in latex and would like any and all documents I can lay my hands on.

link