| > As someone ... Modern OCR is too good I also have even recent extensive experience: I get an important amount of avoidable errors. > at which any post-processing step you do will introduce more errors than it fixes ... the errors they [(LLMs)] introduce are _designed to be plausible_ You are thinking of a fully automated process, not of the human verification through `diff ocr_output llm_corrected`. And even then, given that I can notice errors that an algorithm with some language proficiency could certainly correct, I have reasons to suppose that a proper calibration of an LLM based system can achieve action over a large number of True Positives with a negligible amount of False Positives. > LSTM I am using LSTM-based engines, and on those outputs I have stated «I get an important amount of avoidable errors». The one thing that could go in your direction is that I am not using the latest version of `tesseract` (though still in the 4.x), and I have recently noticed (already through `tesseract --print-parameters | grep lstm`) that the LSTM engine evolved within 4.x, from early to later. > numbers and abbreviations which an LLM obviously can't fix ? It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" and for abbreviations, the LLM is exactly the thing that should guess them ot of the context. The LLM is that thing that knows that "the one defeated by Cromwell should really be Charles II-staintoberemoved instead of an apparent Charles III". |
Fair, and I'm aware that that makes a huge difference in how worthwhile an LLM is. I'm glad you're not doing the annoyingly common "just throw AI at it" without thinking through the consequences.
I'm doing two things to flag words for human review: checking the confidence score of the classifier, and checking words against a dictionary. I didn't even consider using an LLM for that since the existing process catches just about everything that's possible to catch.
> I am using LSTM-based engines . . .
I'm using Tesseract 5.5. It could actually be that much better, or I could just be lucky. I've got some pretty well-done scans to work with.
> It's the opposite: for the numbers it could go (implicitly) "are you sure, I have a different figure for that" . . .
I honestly don't know what you mean. Are you saying that an LLM would know that a reference to "page 311" should actually be a reference to "page 317" based on context?
I think the example you've given makes a lot of sense if you're just using an LLM as one way to flag things for review.