Hacker News new | ask | show | jobs
by asveikau 66 days ago
What OCR do you guys use? I have only seen OCR that makes a lot of errors. Having it be usable requires tons of manual review. I probably wouldn't trust an LLM to do that review because it may introduce its own errors.

Edit: downvoters, would you like to answer my question? I would genuinely like to know. I thought based on the confidence of the comment above there must be a super accurate OCR I've never heard of, but after seeing the sibling comment I'm going to guess there isn't.

2 comments

Stirling PDF https://github.com/Stirling-Tools/Stirling-PDF is a free self-hosted PDF tool that can do very accurate OCR while keeping the formatting.
Modern OCR is VERY accurate. Heck Adobe Acrobat Pro OCR was essentially perfect 20 years ago.
One of my hobbies is typesetting modern editions of a certain type of rare, obscure old books that were poorly typeset to begin with. Modern OCR—and I’ve tried plenty of tools—is still rather error prone in my application.
Can you name a good open source one? I have spent many hours in the current decade correcting OCR errors. Mostly tesseract.