Hacker News new | ask | show | jobs
by cvz 489 days ago
As someone who's learning how to do OCR in order to re-OCR a bunch of poorly digitized documents, this will not work with modern OCR. Modern OCR is too good.

If you're able to improve the preprocessing and recognition enough, then there's a point at which any post-processing step you do will introduce more errors than it fixes. LLM's are particularly bad as a post-processing step because the errors they introduce are _designed to be plausible_ even when they don't match the original text. This means they can't be caught just by reading the OCR results.

I've only learned this recently, but it's something OCR experts have known for over a decade, including the maintainers of Tesseract. [1]

OCR is already at the point where adding an LLM at the end is counterproductive. The state of the art now is to use an LSTM (also a type of neural network) which directly recognizes the text from the image. This performs shockingly well if trained properly. When it does fail, it fails in ways not easily corrected by LLM's. I've OCR'ed entire pages using Tesseract's new LSTM engine where the only errors were in numbers and abbreviations which an LLM obviously can't fix.

[1] https://tesseract-ocr.github.io/docs/Limits_on_the_Applicati...

2 comments

> As someone ... Modern OCR is too good

I also have even recent extensive experience: I get an important amount of avoidable errors.

> at which any post-processing step you do will introduce more errors than it fixes ... the errors they [(LLMs)] introduce are _designed to be plausible_

You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`. And even then, given that I can notice errors that an algorithm with some language proficiency could certainly correct, I have reasons to suppose that a proper calibration of an LLM based system can achieve action over a large number of True Positives with a negligible amount of False Positives.

> LSTM

I am using LSTM-based engines, and on those outputs I have stated «I get an important amount of avoidable errors». The one thing that could go in your direction is that I am not using the latest version of `tesseract` (though still in the 4.x), and I have recently noticed (already through `tesseract --print-parameters | grep lstm`) that the LSTM engine evolved within 4.x, from early to later.

> numbers and abbreviations which an LLM obviously can't fix

? It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" and for abbreviations, the LLM is exactly the thing that should guess them ot of the context. The LLM is that thing that knows that "the one defeated by Cromwell should really be Charles II-staintoberemoved instead of an apparent Charles III".

> You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`.

Fair, and I'm aware that that makes a huge difference in how worthwhile an LLM is. I'm glad you're not doing the annoyingly common "just throw AI at it" without thinking through the consequences.

I'm doing two things to flag words for human review: checking the confidence score of the classifier, and checking words against a dictionary. I didn't even consider using an LLM for that since the existing process catches just about everything that's possible to catch.

> I am using LSTM-based engines . . .

I'm using Tesseract 5.5. It could actually be that much better, or I could just be lucky. I've got some pretty well-done scans to work with.

> It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" . . .

I honestly don't know what you mean. Are you saying that an LLM would know that a reference to "page 311" should actually be a reference to "page 317" based on context?

I think the example you've given makes a lot of sense if you're just using an LLM as one way to flag things for review.

> I honestly don't know what you mean. Are you saying that an LLM would know that a reference to "page 311" should actually be a reference to "page 317" based on context

Not that of your example (the page number): that would be pretty hard to check (with current general agents. In the future, not impossible - you would need some agent finally capable to follow procedures strictly). That the extra punctuation or accent is a glitch and that the sentence has a mistake are more within the realm of a Language Model.

What I am saying is that a good Specialized Language Model (maybe a good less efficient LLM) could fix a text like:

"AB〈G〉 [Should be 'ABC'!] News was founded in 194〈S〉 [Should be '1945'!] after action from the 〈P〉CC [Should be 'FCC'!], 〈_〉 [Noise!], deman〈ci〉ing [Should be 'demanding'!] pluralist progress 〈8〉 [Should be '3'!] 〈v〉ears [should be 'years'] ear〈!〉ier [Should be 'earlier'!]..."

since it should "understand" the sentence and be already informed of the facts.

This is moot anyway if the LLM is only used as part of a review process. But the most valuable documents to digitize are, almost by definition, those that don't have widely-known information that an LLM is statistically likely to guess. There's no way to get around that.
> This is moot anyway if the LLM is only used as part of a review process

Not really, because if plan to perform a `diff` in order to ensure that there are no False Positives in the corrections proposed by your "assistant", you will want one that finds as many True Positives as possible (otherwise, the exercise will be futile, inefficient, if a large number of OCR errors remain). So, it would be good to have some tool that could (in theory, at this stage) be able to find the subtle ones (not dictionary related, not local etc.).

> But the most valuable documents to digitize are, almost by definition, those that don't have widely-known information that an LLM is statistically likely to guess

It depends on the text. You may be e.g. interested in texts the value of which is in arguments, the development of thought (they will speak about known things in a novel way) - where OCR error has a much reduced scope (and yet you want them clean of local noise). And, if a report comes out from Gallup or similar, with original figures coming from recent research, we can hope that it will be distributed foremostly electronically. Potential helpers today could do more things that only a few years ago (e.g. hunspell).

I tried it! (How very distracted of us not to have tried immediately.) And it works, mostly... I used a public LLM.

The sentence:

> ABG News was founded in 194S after action from the PCC , _ , demanciing pluralist progress 8 vears ear!ier

is corrected as follows, initially:

> ABG News was founded in 1945 after action from the FCC, demanding pluralist progress 8 years earlier

and as you can see it already corrects a number of trivial and non-trivial OCR errors, including recognizing (explicitly in the full response) that it should be the "FCC" to "demand pluralist progress" (not to mention, all of the wrong characters breaking unambiguous words and the extra punctuation).

After a second request, to review its output to "see if the work is complete", it sees that 'ABG', which it was reluctant to correct because of ambiguities ("Australian Broadcasting Corporation", "Autonomous Bougainville Government" etc.), should actually be 'ABC' because of several hints from the rest of the sentence.

As you can see - proof of concept - it works. The one error it (the one I tried) cannot see is that of '8' instead of '3': it knows (it says in a full response) that the FFC acted in 1942, and ABC was created in 1945, but it does not compute the difference internally nor catch the hint that '8' and '3' are graphically close. Maybe one LLM with "explicit reasoning" could do even better and catch also that.

I'm saying it's moot because, if you're just flagging things for review, there's already a more direct and reliable way to do that. The OCR classifier itself outputs a confidence score. The naieve way of just checking that confidence score will work. The OCR classifier has less overall information than an LLM, but the information it has is much more relevant to the task it's doing.

When I have some time in front of a computer, I'll try a side-by side comparison with some actual images.

> OCR is already at the point where adding an LLM at the end is counterproductive

That's mass OCR on printed documents. On handwritten documents, LLMs help. There are tons of documents that even top human experts can't read without context and domain language. Printed documents are intended to be readable character by character. Often the only thing a handwriting author intends is to remind himself of what he was thinking when he wrote it.

Also, what is the downstream tasks? What do you need character level accuracy for? In my experience, it's often for indexing and search. I believe LLMs have a higher ceiling there, and can in principle (if not in practice, yet) find data and answer questions about a text better than straightforward indexing or search can. I can't count the number of times I've e.g. missed a child in genealogy because I didn't think of searching the (fully and usually correctly) indexed data for some spelling or naming variant.

I am working with printed documents. Maybe LLMs currently make a difference with handwriting recognition. I wasn't directly responding to that. It's outside the little bit that I know, and I didn't even think of it as "OCR".

I'm not saying that I need high accuracy (though I do), I'm saying that the current accuracy (and clarifying that this is specifically for printed text) is already very high. Part of the reason it's so high is because the old complicated character-by-character classifiers have already been replaced with neural networks that process entire lines at a time. It's already moving in the direction you're saying we need.